Closed supersam654 closed 9 years ago
I added 'byline' to classList for Techcrunch, which includes "Posted [datetime] by [author]". So not a global fix, but should eliminate the "hours ago" keyword issue with TC articles.
Added a few more classes to classList, so crawler won't catch "hours ago" anymore. Also updated the mongodb qdoc collection to update the 'content' and 'keywords' fields.
There's a new problem of this type: its articles that appear with the keywords: "frog" "species" "horned" They are BusinessInsider articles.
This isn't part of article content and we don't want it. It's currently only showing up for TC but the fix will ideally prevent it everywhere.