grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 787 forks source link

Goose is not working on extracting data from Kissmetrics blog which have some meta tags present. #245

Open jijoy opened 9 years ago

jijoy commented 9 years ago

I am trying to extract content from http://feedproxy.google.com/~r/KISSmetrics/~3/cmb43Q4Mzak/ which gets redirected to this https://blog.kissmetrics.com/optimize-your-social-media-ad-spend-with-advanced-targeting-options/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+KISSmetrics+%28KISSmetrics+Marketing+Blog%29

I am getting below error.

File "D:\env\lib\site-packages\gooseinit.py", line 56, in extract return self.crawl(cc) File "D:\env\lib\site-packages\gooseinit.py", line 66, in crawl article = crawler.crawl(crawl_candiate) File "D:\env\lib\site-packages\goose\crawler.py", line 154, in crawl self.article.title = self.title_extractor.extract() File "D:\env\lib\site-packages\goose\extractors\title.py", line 99, in extract return self.get_title() File "D:\env\lib\site-packages\goose\extractors\title.py", line 78, in get_title return self.clean_title(title) File "D:\env\lib\site-packages\goose\extractors\title.py", line 42, in clean_title title = title.replace(site_name, '').strip() TypeError: expected a character buffer object

I think it's because of site_map OpenGraph tag in the website.

jijoy commented 9 years ago

Please help me out.