gt-big-data / QDoc

Quick & Dirty Operating Crawler
4 stars 1 forks source link

Gettext method testing #20

Closed tingofurro closed 8 years ago

tingofurro commented 8 years ago

When going from a soup to the "content" of the article, some elements of are omitted. The issue seems to be with paragraphs that have links in them. I isolated a case where it happens with VentureBeats (http://venturebeat.com/2015/11/06/activision-blizzards-strategy-for-world-conquest/), but I also noticed that it happens with other sources. In the case of VentureBeats, the entire first paragraph doesn't show in the content crawled.

tingofurro commented 8 years ago

Fixed it. The bug was due to the removeComments() method in crawlContent removing any text that had a "if" substring (for instance any text with the word California)