Forbes.com text extraction gives redundant date in some cases

grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python

Apache License 2.0

3.98k stars 788 forks source link

While extraction from Forbes.com not getting the needed data and getting unnecessary data in many cases . Here the code

>>>from goose import Goose
>>> g=Goose()
>>> art=g.extract(url='http://www.forbes.com/2009/03/18/federal-funds-commerce-ibm-markets-transcript-aig.html')
>>> art.title
u"Full Text: Edward Liddy's Testimony Before Congress"
>>> art.cleaned_text
u'Katy Perry earned $135 million this year--more than any other entertainer on Earth.'

I am getting the same text in many links of this type. What can be the issue and how to correct this???

grangier / python-goose

Forbes.com text extraction gives redundant date in some cases #236