grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 788 forks source link

Not getting any extracted text #210

Open peterswang opened 9 years ago

peterswang commented 9 years ago

Tried the following, but only got the title, and no text:

from goose import Goose url = 'http://householdproducts.nlm.nih.gov/cgi-bin/household/list?tbl=TblBrands&alpha=0' g = Goose() article = g.extract(url=url) article.title u'Household Products Database - Health and Safety Information on Household Products' article.meta_description '' article.cleaned_text[:2000] u''

Downloaded this page and tried extracting raw HTML as follows, and got the same result:

raw_html = html_file.read() a = g.extract(raw_html=raw_html) a.title u'Household Products Database - Health and Safety Information on Household Products' a.meta_description '' a.cleaned_text u'' raw_html '\n\nHousehold Products Database - Health and Safety Information on Household Products\n\n\n\n\n Githubissues.

  • Githubissues is a development platform for aggregating issues.