Open lironesamoun opened 1 year ago
I'm facing the same issue for cnn articles (e.g. https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html). It seems that lxml.etree.fromstring(resp.text, parser=lxml.html.HTMLParser())
returns None
for some reason. Haven't investigated it any further but it seems that this is an issue with the lxml.
I did some more analysis. It seems that for the same article goose3 works well while extruct crashes. Both libraries use lxml
. The only difference is in goose3.utils.encoding.smart_str
function being applied (https://github.com/goose3/goose3/blob/d3c404a79e0e15b7957355083bd5a7590d4103ba/goose3/parsers.py#L59). I've checked it manually and it seems to do the trick for me.
Additionally, there is a lxml.html.soupparser
module that can also be used.
To summarize, either of the two worked for me:
from goose3.utils.encoding import smart_str
html = '...'
extruct.extract(smart_str(html), syntaxes=['json-ld'])
from lxml.html import soupparser
html = '...'
extruct.extract(soupparser.fromstring(html), syntaxes=['json-ld'])
Interesting ! Thanks for the info. I'll definitely try that !
Another option is to parse the HTML on your end and pass an already parsed tree (in lxml.html format) to the extruct library, most syntaxes support that in the last release. For example we're internally using an HTML5 parser https://github.com/kovidgoyal/html5-parser/ passing treebuilder='lxml_html'
(happy to share more details), which is more compatible compared to default lxml.html parser.
Hi everyone, I believe this is fundamentally an encoding issue, as vasniktel suggested. Try to feed extruct directly with bytes instead of (mistakenly) utf-8 decoded strings to prevent it from happening.
Example:
import requests
import extruct
u = "https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html"
r = requests.get(u) # note that r.content is a bytes object
# crashes
extruct.extract(r.content.decode("utf-8"))
# works
extruct.extract(r.content)
I made a small script in order to try the scrapping process.
I have a case when If I use extruct as the CLI, I get lots of information about the schema extracted.
extruct [url]
However, if I use for the same url
schema = extruct.extract(html_content, base_url=url)
, I get the error"lxml.etree.ParserError: Document is empty"
The url is valid and the content of html_content (response.text) is valid and full.I tried also with a fresh python environment when I've installed only extruct and I still get the error.
Any insights about why it failed by using the python code ?