gt-big-data / QDoc

Quick & Dirty Operating Crawler
4 stars 1 forks source link

Crawler grabs bad content for this Reuters article. #15

Closed supersam654 closed 8 years ago

supersam654 commented 9 years ago

http://reuters.us.feedsportal.com/c/35217/f/654198/s/4a5d7276/sc/7/l/0L0Sreuters0N0Carticle0C20A150C10A0C0A20Cus0Eiran0Enuclear0Ekerry0Ezarif0EidUSKCN0ARW2JA20A1510A0A20DfeedType0FRSS0GfeedName0FworldNews/story01.htm

{
    "feed": "reuters_world", 
    "img": "http://s3.reutersmedia.net/resources/r/?m=02&d=20151002&t=2&i=1084104572&w=644&fh=&fw=&ll=&pl=&sq=&r=LYNXNPEB9111N", 
    "title": "Kerry, Iran's Zarif discuss nuclear deal: State Department", 
    "url": "http://reuters.us.feedsportal.com/c/35217/f/654198/s/4a5d7276/sc/7/l/0L0Sreuters0N0Carticle0C20A150C10A0C0A20Cus0Eiran0Enuclear0Ekerry0Ezarif0EidUSKCN0ARW2JA20A1510A0A20DfeedType0FRSS0GfeedName0FworldNews/story01.htm", 
    "timestamp": 1443826946.0, 
    "content": "World\nRelated: \nWorld\n\r\n                                \t\r\n                                \t\tReuters/Stephanie Keith\nEric Beech\n in Washington and \nParisa Hafezi\n in New York; Editing by \nMohammad Zargham\nState of Innovation\nPremier Content\nFind out what\u2019s in store for our digital-everything lives.\n Finding Cures in Early Research\n Saving Species, By the Numbers", 
    "source": "reuters", 
    "keywords": [], 
    "guid": "us-iran-nuclear-kerry-zarif-idUSKCN0RW2JA"
}
mersted commented 9 years ago

I believe the content is in the "focusParagraph" span class, but the HTML is super messy

tingofurro commented 8 years ago

This is now fixed. (It was the same problem as issue #20 . To see if it works now, you can run recrawl.py and give it parameter 560f0fc2a6b867b094aa343d (that's the ID of the article). The newContent.txt contains the article, while oldContent.txt contains not the right stuff.