jeff1evesque / ist-664

Syracuse IST-664 Final Project with Chris Wilson (team member)
2 stars 3 forks source link

Fix and improve mapreduce #29

Closed jeff1evesque closed 5 years ago

jeff1evesque commented 5 years ago

We need to fix and further improve our mapreduce before implementing bag of words, and further analysis.

jeff1evesque commented 5 years ago

The following is an example of an aggregated instance by link_id:

    'value': {
        'values': [{
                    'values': [{
                            'id': 'c255d',
                            'body': "Didn't we have a diamond story just recently?",
                        }, {
                            'id': 'c255q',
                            'body': 'yea, and the majority response was, "well, we all agree that its absolute b.nd see what happens."',
                            'score': 9.0,
                            'comment': 'yea, and the majority response was, "well, we all agree that its sary time and see what happens."',
                            'parent_id': 't1_c255d'
                        }, {
                            'id': 'c255w',
                            'body': 'Try getting a $2000 gift cardhe geopolitical injustices perpetuated by the diamond tyrants, and also my philosophical disagreements with using a gift card worth the same amount as that engagement ring."',
                            'score': 11.0,
                            'comment': 'Try getting a $2000 gift ca the geopolitical injustices perpetuated by the diamond tyrants, and also my philosophical disagreements with using a gift card worth the same amount as that engagement ring."',
                            'parent_id': 't1_c02529'
                        }, {
                            'id': 'c256y',
                            'body': ' earrings. Both were priced about the same. Which do you think she chose?',
                            'score': 9.0,
                            'comment': 'Gave the wifeth were priced about the same. Which do you think she chose?',
                            'parent_id': 't3_24xr'
                        }, {
                            'id': 'c257t',
                            'body': '[dd': 'c257u',
                            'body': 'And I stopped playing Falling sand for this? The article is from 1982!',
                            'score': 4.0,
                            'comme, '
                            parent_id ': '
                            t3_24xr '}, {'
                            id ': '
                            c258h ', '
                            body ': "Why do you even need an engagement ring? If I'
                            m correct that 's 0, '
                            comment ': "Why do you even need an engagement ring? If I'
                            m correct that 's one of those modern things that they : '
                            The sad thing is,
                            everyone is still being fooled.
                            ', '
                            score ': 11.0, '
                            comment ': '
                            The sad thing is,
                            everyone is stis no "real value" of\ 's all about how much people are willing to pay.',
                            'score': 28.0,
                            'comment': 'thing to pay.',
                            'parent_id': 't1_c02529'
                        }, {
                            'id': 'c25av',
                            'body': 'Mod this guy up. Fundamental to the discussion ofit. Nothing else.',
                            'score': -1.0,
                            'comment': 'Mod this guy up. Fundamental to the discussion of markets is the obs '
                            parent_id ': '
                            t1_c259u '}, {'
                            id ': '
                            c25az ', '
                            body ': ' & gt;If I\ 'm correct that\'s one of those modern things\r\n\r\ng itself can be tied to the Fourth Lateran Council presided over by Pope Innocent III in 1215."\r\nhttp://en.wikipe correct that\'s one of those modern things\r\n\r\nOnly if you count 1215 as modern.\r\n\r\n"The inception of the e by Pope Innocent III in 1215."\r\n',
                            'parent_id': 't1_c258h'
                        }, rrings would be bigger diamonds, or in my wife 's case a perhaps even more ridiculously priced item, a Chanel bag.",arrings would be bigger diamonds, or in my wife'
                        case a perhaps even more ridiculously priced item, a Chanel bag.
                        ", 'comment': '[deleted]', 'parent_id': 't1_c256y'}, {'id': 'c25bu', 'body': '[deleted]', 'score': 6.0, 'comment': 'P said it was the engagement ring per se that was modern - which is not true.', 'score': -3.0, 'comment': 'Yeah but true.', 'parent_id': 't1_c25bu'}, {'id': 'c25d8', 'body': "
                        No offense by I
                        for one wouldn 't want the kind of wife se by I for one wouldn'
                        t want the kind of wife who would prefer a computer to earings.
                        ", 'parent_id': 't1_c256y'},  'my friend buys diamonds', 'parent_id': 't3_24xr'}, {'id': 'c25dr', 'body': "
                        I dunno.I like a woman who knows howo ? \r\ n\ r\ nThere 's a line where extravagant becomes absurd. I'
                        m not sure I 'd want a woman with tastes that expensiven who knows how to preen herself, but a single pair of earrings the cost of a new Macbook Pro?\r\n\r\nThere'
                        s a lin that expensive.Not
                        for a wife, anyways.
                        ", 'parent_id': 't1_c25d8'}, {'id': 'c25du', 'body': 'Actually fundamentalt is a monopoly with a single organization throttling the supply of diamonds in the world. The reason that diamondso DeBeers.\r\nBut otherwise, in most other areas where there is healthy competition, the prices of items are close e discussion is an understanding of the nature of the diamond market. It is a monopoly with a single organization tte hugely from their "
                        real value " is because there are no competitors to DeBeers.\r\nBut otherwise, in most other air real values.', 'parent_id': 't1_c25av'}, {'id': 'c25e3', 'body': '> the prices of items are close to their reing. Humans place a value on something by deciding how much to pay for it, and whether to buy it or not. This is norinciple.\r\n\r\nHave a read of this:', 'score': -6.0, 'comment' get it do you. There\'s no "
                        real value " of anything. Humans place a value on something by deciding how much to paystionable tactics - just a fundamental economic principle.\r\n\r\nHave a read of this: since I read the '[Diamond Maker](' (HG Wells) I have tended to discount thell of the Russian empire of the eighties, stories of synthetic diamonds have been filtering out.  Now while the Swediamond) since the '50's - it has only recently become a threat to natural diamonds as Russian crafted synthetic diere](\r\n\r\nSo much so that DeBeers have spent [consideration]( of diamonds. You could almost add another C to the *4C's*....\er since I read the '[Diamond Maker](' (HG Wells) I have tended to discount the l of the Russian empire of the eighties, stories of synthetic diamonds have been filtering out.  Now while the Sweeiamond) since the '50's - it has only recently become a threat to natural diamonds as Russian crafted synthetic diare](\r\n\r\nSo much so that DeBeers have spent [consideration]( of diamonds. You could almost add another C to the *4C's*....\rd': 'c25em', 'body': "
                        Judging from the fact that you 're a redditer, I assumed the answer was that she chose a MacBoguy who knows I'
                        d rather have a sweet new computer or an amazing trip than jewelry.
                        ", 'score': 10.0, 'comment': "
                        Ju chose a MacBook Pro, but I now realize that 's not what you meant.  I hope I end up with a guy who knows I'
                        d rather1_c256y '}, {'
                        id ': '
                        c25f7 ', '
                        body ': '
                        From the link that you posted : \r\ n\ r\ nIn neoclassical economics, the value of aan open and competitive market.This is determined primarily by the demand
                        for the object relative to supply.\r\ n\ rk!', '
                        score ': 0.0, '
                        comment ': '
                        From the link that you posted: \r\ n\ r\ nIn neoclassical economics, the value of an objen and competitive market.This is determined primarily by the demand
                        for the object relative to supply.\r\ n\ r\ nTha 'parent_id': 't1_c25e3'
                        'id': 'c25fh',
                        'body': '<i>\r\nit has only recently become a threat to natural dia market.\r\n</i>\r\n\r\nMy point precisely.. So it does seem like there were no real competitors to DeBeers ubecome a threat to natural diamonds as Russian crafted synthetic diamonds of gem quality have permiated the market. real competitors to DeBeers until recently.',
                        'parent_id': 't1_c25el'
                        'id': 'c25fo',
                        'body': 'So we are in agre them.',
                        'score': -4.0,
                        'comment': 'So we are in agreement then? Diamonds have no "real value" other than what peophad seen the documentary about the synthetic diamonds being produced cheaply in Russia. Apparently, the synthetic d DeBeers spin machine was busy trying  to prove to people that the flaws were a sign of it being natural and therefhad seen the documentary about the synthetic diamonds being produced cheaply in Russia. Apparently, the synthetic d DeBeers spin machine was busy trying  to prove to people that the flaws were a sign of it being natural and therefd': 'c25gf',
                        'body': 'I am neither in agreement nor disagreement. :-) My point was just that if the market were com definition that I quoted.\r\n\r\nIMHO, your definition of "value", i.e., what people will pay for them, is more apaintings are expensive. But only those painters that people take a "fancy" to are  expensive. But even in such a sithe price  for such paintings would be drastically lower than if they were rarer to obtain.',
                        'score': 0.0,
                        'commenf the market were competitive then the price of diamonds would be closer to the "value" as per the definition that  for them, is more appropriate in the field of something like say the art industry. Not all rare paintings are expeBut even in such a situation, if Van Gogh were able to paint a million paintings in his lifetime, the price  for su'
                        parent_id ': '
                        t1_c25fo '}, {'
                        id ': '
                        c25gj ', '
                        body ': '
                        http: //', 'score': -1.0, 'comment': 'htt': "Nobody is being fooled. Nobody disputes that its value on the street far exceeds it true rarity. The point is tPeople were being fooled if they invested in a diamond ring thinking it would turn out to be some grand return lateh than pawned or sold. \r\n\r\nYeah it would be nice if things didn't work like this, but it's been the practice foell of a long time. It's simply a physical adjunct to an entirely abstract idea. I don't see the commotion.", 'scorn the street far exceeds it true rarity. The point is that the value is not inherent in the item itself, but ratherthinking it would turn out to be some grand return later in life, when fact is that far more are held onto by theirgs didn't work like this, but it's been the practice for numerous years. It's a modern dowry, in a sense and that'sirely abstract idea. I don't see the commotion.", 'parent_id': 't1_c259h'}, {'id': 'c25jd', 'body': '*People were bbe some grand return later in life, when fact is that far more are held onto by their owners into death than pawnednd up in pawn shops due to break-ups and aren\'t all worn for a lifetime doesn\'t diminish the force of your argumeo when they were new.  I\'d rather not assume that a pawned diamond indicates somebody was "fooled"\'s the lawnce the relationship ends people may not want to be reminded of the person the ring was meant to signify.  \r\n\r\nuy a new diamond for a new girlfriend 2 years later.  It\'s not a matter of learning a lesson but rather deciding tituation are, is worth spending a lot of money on.\r\n\r\nIf you want to pick up a good quality, cheap diamond, loo advise telling your beloved that you bought her ring at a pawn broker any more than I would advise telling her tha any better.  The value of the item proceeds from its symbolic meaning linked to one woman and the *perception* tha spend large amounts of cash on squashed carbon that they KNOW they\'ve been culturally conditioned to buy, the monnI know some women who don\'t care where the diamond came from.  But other women consider that a diamond can only bdrive used cars or wear thrift store clothing.  I admit it\'s bizarre, but who here can really say that they undershey invested in a diamond ring thinking it would turn out to be some grand return later in life, when fact is that nEven if you conceded that tons of diamond engagement rings end up in pawn shops due to break-ups and aren\'t all wecause those pawned diamonds don\'t have any value compared to when they were new.  I\'d rather not assume that a preturns at work.  Rings are sentimental, personal gifts and once the relationship ends people may not want to be rean the same guy who pawned a ring wouldn\'t turn around and buy a new diamond for a new girlfriend 2 years later.   the ring, no matter what your view of the economics of the situation are, is worth spending a lot of money on.\r\nnear military bases, the closer, the better.  But I wouldn\'t advise telling your beloved that you bought her ring  you found for her came from a thrift store that didn\'t know any better.  The value of the item proceeds from its are, and if hundreds of thousands of men in Western countries spend large amounts of cash on squashed carbon that the item rises according to those market forces anyway.\r\n\r\nI know some women who don\'t care where the diamond cse the symbiosis will be meaningless to them, even women who drive used cars or wear thrift store clothing.  I admiy?', 'parent_id': 't1_c25go'}, {'id': 'c25kv', 'body': 'Yeesh.', 'score': 4.0, 'comment': 'Yeesh.', 'parent_id': 'ttes that its value on the street far exceeds it true rarity.\r\n\r\nNobody disputes it, but almost nobody knows it  Just like the chimp who won't take good food as a reward when he sees that you've got better food to share. Many pgt; Nobody is being fooled. Nobody disputes that its value on the street far exceeds it true rarity.\r\n\r\nNobody atural diamonds were, demand would fall. Just like the chimp who won't take good food as a reward when he sees thats charged.", 'parent_id': 't1_c25go'}, {'id': 'c25n1', 'body': 'But Diamonds *are* rare. DeBeers sees to it that th it that they are.', 'parent_id': 't1_c25l6'}, {'id': 'c25n3', 'body': '> IMHO, your definition of "value", i.e.onomics\' definition.', 'score': -2.0, 'comment': '> IMHO, your definition of "value", i.e., what people will pa', 'parent_id': 't1_c25gf'}, {'id': 'c25nv', 'body': 'I understand women. Ask away.', 'score': -6.0, 'comment': 'I ': 'I know that. What does that have to do with my point?', 'score': 1.0, 'comment': 'I know that. What does that h'This whole discussion is kind of missing the point. Yes, value is determined by humans. But in most markets, it\'s the price it would bring in an *open and competitive* market. In an open and competitive market, diamonds would nojewels, and so competitors would undercut one another until they\'d reached a price proportional to the cost of manf when they say "real value." The reason that diamonds are so expensive is not that they\'re difficult to make or fat they cannot be undercut. \r\n\r\nAnd that\'s what\'s illegitimate about the whole operation, which is what elicired product, but by closing the market to competition.', 'score': 11.0, 'comment': 'This whole discussion is kind o it\'s also determined by how easy it is to get. As noted in the link, the value is the price it would bring in an uld not bring a very high price, because they\'re very cheap to find and turn into jewels, and so competitors wouldof manufacture. This, I think, is what the other posters are intuitively thinking of when they say "real value." The or find, but because De Beers has closed the market to competition in this way that they cannot be undercut. \r\n elicits distaste from readers; De Beers mainly makes money not by providing a desired product, but by closing the iamonds aren't that commonly distributed. How many diamond mines are there in the world? What is common about them  indirectly by DeBeers.", 'score': -6.0, 'comment': "Diamonds aren't that commonly distributed. How many diamond mises from one mine. Of which most are owned directly or indirectly by DeBeers.", 'parent_id': 't1_c25p3'}, {'id': 'c misconception that something has an intrinsic value - it doesn't.", 'score': -6.0, 'comment': "We all know DeBeersn intrinsic value - it doesn't.", 'parent_id': 't1_c25p3'}, {'id': 'c25rx', 'body': 'Cool. I\'ll take \'em all inst financial value of something she can *actually use*, and who didn\'t demand an overpriced rock as a "symbol" of myr\n\r\nI\'d *much* rather have a wife who understood the practical and financial value of something she can *actualon.', 'parent_id': 't1_c25d8'}]}, {'values': [{'id': 'c25sq', 'body': 'As of me reading this, there are 32 commentsnd is, more often than not, actually paying for the right to have sex with a specific woman. The value of *that* maemantics). Being happily married for 13 years, I hereby declare that diamonds are for... other people.', 'score': -e mention of the word "sex". So be it: overpaying for a diamond is, more often than not, actually paying for the riigher than the intrinsic value of the diamond (whatever the semantics). Being happily married for 13 years, I hereb'id': 'c25tv', 'body': "Apparently, back when God created man, he not only made sex outside of marriage possible, bave sex, and didn't need a marriage proposal to persuade them.  I know, I didn't believe it at first either, but I oes without saying that God'll strike down any woman who lets a man touch her before he's given her a diamond ring.back when God created man, he not only made sex outside of marriage possible, but he also gave women a sex drive soproposal to persuade them.  I know, I didn't believe it at first either, but I read it on the internet somewhere, se down any woman who lets a man touch her before he's given her a diamond ring.  I mean, that's just common sense."ssue has been inconsistent these last years. Are you implying that He is in cahoots with DeBeers?\r\n", 'score': 7.years. Are you implying that He is in cahoots with DeBeers?\r\n", 'parent_id': 't1_c25tv'}, {'id': 'c262q', 'body':e': -2.0, 'comment': "Why does this guy get -7 points? I think that's an amazing comment.", 'parent_id': 't1_c25df' 'parent_id': 't1_c02529'}, {'id': 'c26ag', 'body': "Wow, a company uses advertising to sway people to buy something.", 'score': 0.0, 'comment': "Wow, a company uses advertising to sway people to buy something they don't need? Fas24xr'}, {'id': 'c26nf', 'body': '[deleted]', 'score': 1.0, 'comment': '[deleted]', 'parent_id': 't1_c25em'}, {'id': not yet born.', 'score': 3.0, 'comment': 'Feb. 1982  --> probably over half the reddit users were not yet born.or generations.  It's the scale of the operation that is impressive.  Something Dr. Evil could do.", 'score': 0.0, ration that is impressive.  Something Dr. Evil could do.", 'parent_id': 't1_c26ag'}, {'id': 'c275h', 'body': 'The " instance, you valued the look of the dollar coin above the paper bill, in theory, without a a law attached, you wocoin was indeed more valuable. It is only with a combination of law and general public acceptance that we "value" b say that De Beers with the help of their marketing company could sell a "ketchup popcile to a lady with white glov to educate the public.', 'score': 0.0, 'comment': 'The "real" price of anything is only the value that YOU and I pe paper bill, in theory, without a a law attached, you would be able to buy more with it so long as the person agreon of law and general public acceptance that we "value" both as one and the same. It\'s not so with the diamond indy could sell a "ketchup popcile to a lady with white gloves." Unbelievable. I\'m giving this article to every woman'c275l', 'body': '[deleted]', 'score': 2.0, 'comment': '[deleted]', 'parent_id': 't1_c25nv'}, {'id': 'c275m', 'bodythe demise of the DeBeers cartel and the devaluation of diamonds due to a handful of factors (investment diamonds, e diamond market (and DeBeers, I presume) is alive and flourishing... evident from displays of public consumption bsted phones and ipods.', 'score': 2.0, 'comment': 'The most significant aspect of the date, is that the article for to a handful of factors (investment diamonds, new sources, loss of control over African mines, etc).\r\n\r\nIf any. evident from displays of public consumption by rap artists (ice/bling) and airheaded hollywood socialites with dibody': 'Oh yes, he\'s gotta know how to "take me places I\'ve never gone before."', 'score': 1.0, 'comment': 'Oh yet_id': 't1_c26nf'}]}, {'id': 'c3cle', 'body': 'Interesting that both the diamond and the computer depreciate in val least the diamond has a practical floor price.  The computer may eventually *cost* you to dispose of.', 'score': 2 in value so precipitously.  Sure, the computer does it a bit more gradually, but at least the diamond has a practirent_id': 't1_c25rx'}]}, '_id': 't3_24xr'}
jeff1evesque commented 5 years ago

c72000b: the following is an instance with more than one reduced aggregation:

    'value': {
        'results': {
            'comments': ['[ah, but you can](', 'Reddit is a huge time-sink. I heard they hired a guy whose only job is to tell them to quit reading reddit and get back to work.', 'Yeah, yeah, yeah. I\'m sure the next subdomain will be -- for fanboys and fangirls and faneverythinginbetweens of all manner and description.\r\n\r\nAlso, I\'m completely okay with reddit-centric articles; the site\'s young enough that by the time everyone\'s heard of it, most of the people commenting now will be able to plausibly claim that they were "There from the beginning." So a little community spirit isn\'t a bad thing by any means.', 'Indeed.\r\n\r\nWeb 2.0 for me means constant down time for sites built on unstable but "cool" tech. I live in Australia, and due to my prime time being the small hours, often my "web 2.0" apps that I use regularly are not working.\r\n\r\nHere is the dirty laundry list:\r\n\r\n* Reddit (actually no, its quite reliable now)\r\n* Google reader (less so GMail)\r\n* Backpack (37 signals) - either down or very very slow\r\n* - often down. Its terrible.', 'I think that you guys are doing great, love, Mom', "I am proud of you boys too.  I know you work hard because you don't have time to eat well.  Keep up the good work. Love, Mom.\r\n\r\nSprechen Sie nicht ueber Alexis und Steve, wenn Sie wissen wie schwer sie arbeiten, wuerden Sie nicht schlechtes kommentieren.", 'LOL, thanks mom(s)...'],
            'score': ['[ah, but you can](', 'Reddit is a huge time-sink. I heard they hired a guy whose only job is to tell them to quit reading reddit and get back to work.', 'Yeah, yeah, yeah. I\'m sure the next subdomain will be -- for fanboys and fangirls and faneverythinginbetweens of all manner and description.\r\n\r\nAlso, I\'m completely okay with reddit-centric articles; the site\'s young enough that by the time everyone\'s heard of it, most of the people commenting now will be able to plausibly claim that they were "There from the beginning." So a little community spirit isn\'t a bad thing by any means.', 'Indeed.\r\n\r\nWeb 2.0 for me means constant down time for sites built on unstable but "cool" tech. I live in Australia, and due to my prime time being the small hours, often my "web 2.0" apps that I use regularly are not working.\r\n\r\nHere is the dirty laundry list:\r\n\r\n* Reddit (actually no, its quite reliable now)\r\n* Google reader (less so GMail)\r\n* Backpack (37 signals) - either down or very very slow\r\n* - often down. Its terrible.', 'I think that you guys are doing great, love, Mom', "I am proud of you boys too.  I know you work hard because you don't have time to eat well.  Keep up the good work. Love, Mom.\r\n\r\nSprechen Sie nicht ueber Alexis und Steve, wenn Sie wissen wie schwer sie arbeiten, wuerden Sie nicht schlechtes kommentieren.", 'LOL, thanks mom(s)...'],
            'posts': ["I don't mean to be rude, but I would expect more improvements if they are actually working 16-hour days on the site.\r\n\r\nThat's a hell of a lot of time. Heck, you could probably write Reddit in 16-hours.", "I don't mean to be rude, but I would expect more improvements if they are actually working 16-hour days on the site.\r\n\r\nThat's a hell of a lot of time. Heck, you could probably write Reddit in 16-hours.", "Hey, I have an idea! Let's be just like and stroke our egos by voting ANY STORY that has ANYTHIN AT ALL to do with our site, way the hell up! Because it's really interesting! No seriously I'm not kidding! Why do you think I'm being sarcastic! I love DIGG and I especially love that one out of every two articles is about how awesome and big it is and how it's going to take on slashdot! We need to be just like that because it makes for great reading material! I want each and every one of you to go out RIGHT NOW and find every single story you can about reddit and submit it! Then the whole front page can be about reddit! That would be totally radical to the maximum!\r\n</sarcasm>", "I don't mean to be rude, but I would expect more improvements if they are actually working 16-hour days on the site.\r\n\r\nThat's a hell of a lot of time. Heck, you could probably write Reddit in 16-hours.", '[ah, but you can](', 'I think that you guys are doing great, love, Mom', "I am proud of you boys too.  I know you work hard because you don't have time to eat well.  Keep up the good work. Love, Mom.\r\n\r\nSprechen Sie nicht ueber Alexis und Steve, wenn Sie wissen wie schwer sie arbeiten, wuerden Sie nicht schlechtes kommentieren."],
            'match_id': ['c101j', 'c101j', 'c100z', 'c101j', 'c101p', 'c1036', 'c1042']
    '_id': 't3_zn1'

The following is a trivial example, where the reducer did not operate, either because the link_id had a trivial element, or there were no post-comment pairs:

    'value': {
        'results': {
            'comments': [],
            'score': [],
            'posts': [],
            'match_id': []
    '_id': 't3_zll'

We can now return the aggregated results to python, and perform a bag of word analysis.

jeff1evesque commented 5 years ago

cc5f3b6: we forgot to return the score for the original post. This is the sum total of upvotes subtracted by downvotes. Therefore, we could further filter based on which posts have sufficient interest.

jeff1evesque commented 5 years ago

Our earlier trivial result now reduces to the following:

{'value': None, '_id': 't3_9413'}
{'value': None, '_id': 't3_9510'}
{'value': None, '_id': 't3_9684'}

Note: in this case _id corresponds to the mapreduce key, which represents the link_id.

jeff1evesque commented 5 years ago

Additionally, returning the results to python, then results using pandas dataframe (more efficient that native lists) for splitting, is several orders faster. A benchmark would need to be performed rather than making a statement of being at least 15x faster. Also, though the native aggregation pipeline is supposedly faster than the mapreduce, when sharding is taken into account, mapreduce handles better. Furthermore, since we didn't use mongodb's finalize in conjunction with the mapreduce, the corresponding javascript code can be recycled into other nosql database system.

jeff1evesque commented 5 years ago

We need to make minor comment changes.