gt-big-data / QDoc

Quick & Dirty Operating Crawler
4 stars 1 forks source link

Remove "N hours ago by PERSON NAME" for TechCrunch articles. #11

Closed supersam654 closed 9 years ago

supersam654 commented 9 years ago

This isn't part of article content and we don't want it. It's currently only showing up for TC but the fix will ideally prevent it everywhere.

mersted commented 9 years ago

I added 'byline' to classList for Techcrunch, which includes "Posted [datetime] by [author]". So not a global fix, but should eliminate the "hours ago" keyword issue with TC articles.

mersted commented 9 years ago

Added a few more classes to classList, so crawler won't catch "hours ago" anymore. Also updated the mongodb qdoc collection to update the 'content' and 'keywords' fields.

tingofurro commented 9 years ago

There's a new problem of this type: its articles that appear with the keywords: "frog" "species" "horned" They are BusinessInsider articles.