grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 787 forks source link

Problems Parsing Titles #262

Open grantdelozier opened 8 years ago

grantdelozier commented 8 years ago

Seeing extraction errors on certain websites that have titles.

File "/usr/local/lib/python2.7/site-packages/ContentAnalysis-0.1.1-py2.7.egg/ContentAnalysis/document.py", line 53, in parse ginfo = g.extract(url=self.link) File "/usr/local/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract return self.crawl(cc) File "/usr/local/lib/python2.7/site-packages/goose/__init__.py", line 66, in crawl article = crawler.crawl(crawl_candiate) File "/usr/local/lib/python2.7/site-packages/goose/crawler.py", line 154, in crawl self.article.title = self.title_extractor.extract() File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 99, in extract return self.get_title() File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 78, in get_title return self.clean_title(title) File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 56, in clean_title if title_words[0] in TITLE_SPLITTERS: IndexError: list index out of range

You can replicate by running goose extract on a site like http://daydreamingfoodie.com/

grantdelozier commented 8 years ago

The issue on this site and plenty of others stems from when the title = opengraph site name

Fixed the issue in this commit of my fork