jeff1evesque / ist-664

Syracuse IST-664 Final Project with Chris Wilson (team member)
2 stars 3 forks source link

Aggregate returned mapreduce for tokenization #32

Closed jeff1evesque closed 5 years ago

jeff1evesque commented 5 years ago

We need to collapse our returned result from the mapreduce as python lists. This will be a necessary step before tokenization related techniques.

jeff1evesque commented 5 years ago

448d894: the following implements the mongodb finalize step, which returns None for cases where the reducer is not executed since the corresponding link_id is a single instance:

{'_id': 't3_3vjv', 'value': None}
{'_id': 't3_3vl9', 'value': None}
{'_id': 't3_3vlu', 'value': None}
{
    '_id': 't3_3vlz',
    'value': {
        'score': [2.0],
        'match_id': ['c3vph'],
        'comments': ["why do you think it's a fake? what's so hard in doing a vnc loopback?"],
        'posts': ['A fake, I suppose. Still worth a look. And a smile ;)']
    }
},
{'_id': 't3_3vml', 'value': None}
{'_id': 't3_3vmq', 'value': None}
{'_id': 't3_3vn3', 'value': None}
{'_id': 't3_3vox', 'value': None}
{'_id': 't3_3vpb', 'value': None}
{
    '_id': 't3_3vqd',
    'value': {
        'score': [2.0],
        'match_id': ['c3wvv'],
        'comments': ["(I'm in the business world where skills don't matter, unlike techie land)"],
        'posts': ['Or: "I have the skills and I need the money, but you\'re going to hire the person that went to the same college as you anyway so I might as well get a head start on waiting those tables."']
    }
},
{'_id': 't3_3vrj', 'value': None}
{'_id': 't3_3vso', 'value': None}
{'_id': 't3_3vt0', 'value': None}
{'_id': 't3_3vt1', 'value': None}
{'_id': 't3_3vto', 'value': None}
{'_id': 't3_3vu0', 'value': None}
{'_id': 't3_3vu9', 'value': None}
{'_id': 't3_3vuh', 'value': None}
{
    '_id': 't3_3vut',
    'value': {
        'score': [1.0],
        'match_id': ['c3vxm'],
        'comments': ["That will get you really far...\r\n\r\nSaying that marketing like this cannot be studied is like saying that math cannot be because it is to complex and there will always be unsolved problems.\r\n\r\nI was referring to methods along the lines of what's discussed in: `The Anatomy of Generating Buzz` by Emanuel Rosen. Unfortunately I do not have this book and a reddit search on buzz turned up nothing but bees.."],
        'posts': ["Don't. Just make it good enough and it will generate Buzz for itself."]
    }
},
{'_id': 't3_3vwd', 'value': None}
{'_id': 't3_3vwm', 'value': None}
{'_id': 't3_3vx9', 'value': None}
jeff1evesque commented 5 years ago

The following preserves the index order between lists:

>>> x={1:['a','b','c'], 2:['a', 'i']}
>>> y={1:['d','e','f'],2:['g']}
>>>
>>>
>>> for k, v in x.items():
...     if k in y.keys():
...         y[k] += v
...     else:
...         y[k] = v
...
>>>
>>> print(y)
{1: ['d', 'e', 'f', 'a', 'b', 'c'], 2: ['g', 'a', 'i']}

Therefore, we'll perform an aggregation on the mapreduced values respectively.