grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 788 forks source link

Forbes.com text extraction gives redundant date in some cases #236

Open ethan-hunt-007 opened 9 years ago

ethan-hunt-007 commented 9 years ago

While extraction from Forbes.com not getting the needed data and getting unnecessary data in many cases . Here the code

>>>from goose import Goose
>>> g=Goose()
>>> art=g.extract(url='http://www.forbes.com/2009/03/18/federal-funds-commerce-ibm-markets-transcript-aig.html')
>>> art.title
u"Full Text: Edward Liddy's Testimony Before Congress"
>>> art.cleaned_text
u'Katy Perry earned $135 million this year--more than any other entertainer on Earth.'

I am getting the same text in many links of this type. What can be the issue and how to correct this???

mhamann commented 8 years ago

@ethan-hunt-007 Forbes is largely incompatible with text extractors like goose, newspaper, etc, because their current site uses Javascript to render most of the webpage. That means, you'd have to render the page in a headless browser of some sort, let the JS run, and then extract the text / data. That's a lot more work and probably wouldn't be terribly performant.