grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.97k stars 787 forks source link

Text extraction failed for http://sports.ndtv.com/ #93

Open kambanthemaker opened 10 years ago

kambanthemaker commented 10 years ago

I tried latest dev version, for http://sports.ndtv.com/football/news/223101-cristiano-ronaldo-to-miss-copa-del-rey-final-against-barcelona, which failed. While similar one from the same site, http://sports.ndtv.com/indian-premier-league-2014/news/223541-chennai-super-kings-skipper-ms-dhoni-is-ipls-unbeaten-hero worked well. Could someone help me on this?

kambanthemaker commented 10 years ago

Venturebeat url http://venturebeat.com/2014/04/29/why-culture-integration-is-just-as-important-as-spreadsheets-for-ma/ also failed.

grangier commented 10 years ago

Hello,

For sports.ndtv.com, I guess the content texte is too deep after the heading title.

Regarding venturebeat, you need to use the "soup" parser as lxml doesn't seems to handle well the

kambanthemaker commented 10 years ago

Hi,

Thanks for the quick response, I will try the approach you mentioned. Do you think, I can always use 'soup' parser to handle most of the web pages?

On Fri, May 2, 2014 at 3:08 AM, Xavier Grangier notifications@github.comwrote:

Hello,

For sports.ndtv.com, I guess the content texte is too deep after the heading title.

Regarding venturebeat, you need to use the "soup" parser as lxml doesn't seems to handle well the tag.

url = "http://venturebeat.com/2014/04/29/why-culture-integration-is-just-as-important-as-spreadsheets-for-ma/">>> from goose import Goose>>> g = Goose({'parser_class':'soup'})>>> article = g.extract(url=url)>>> article.cleaned_text[:150]u'With nearly $3\xa0trillion\xa0in play last year, the land of mergers and acquisitions\xa0would be the world\u2019s fifth largest country, nestling in comfortably be'>>>

— Reply to this email directly or view it on GitHubhttps://github.com/grangier/python-goose/issues/93#issuecomment-41960014 .

Regards, Selvam KnackForge http://knackforge.com Acquia Service Partner http://knackforge.com/node/54 No. 1, 12th Line, K.K. Road, Venkatapuram, Ambattur, Chennai, Tamil Nadu, India - 600 053.