scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
849 stars 113 forks source link

lxml.etree.ParserError: Document is empty #207

Open lironesamoun opened 1 year ago

lironesamoun commented 1 year ago

I made a small script in order to try the scrapping process.

I have a case when If I use extruct as the CLI, I get lots of information about the schema extracted. extruct [url]

However, if I use for the same url schema = extruct.extract(html_content, base_url=url) , I get the error "lxml.etree.ParserError: Document is empty" The url is valid and the content of html_content (response.text) is valid and full.

I tried also with a fresh python environment when I've installed only extruct and I still get the error.

import requests
import sys
import extruct

def get_html(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

# Check if URL is provided as a command-line argument
if len(sys.argv) < 2:
    print("Please provide a URL as a command-line argument.")
    sys.exit(1)

url = sys.argv[1]  # Get the URL from the command-line argument
html_content = get_html(url)
if html_content:
    #print(html_content)
    schema = extruct.extract(html_content, base_url=url)
    print(schema)
else:
    print("Failed to retrieve HTML.")

Any insights about why it failed by using the python code ?

Vasniktel commented 1 year ago

I'm facing the same issue for cnn articles (e.g. https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html). It seems that lxml.etree.fromstring(resp.text, parser=lxml.html.HTMLParser()) returns None for some reason. Haven't investigated it any further but it seems that this is an issue with the lxml.

Vasniktel commented 1 year ago

I did some more analysis. It seems that for the same article goose3 works well while extruct crashes. Both libraries use lxml. The only difference is in goose3.utils.encoding.smart_str function being applied (https://github.com/goose3/goose3/blob/d3c404a79e0e15b7957355083bd5a7590d4103ba/goose3/parsers.py#L59). I've checked it manually and it seems to do the trick for me.

Additionally, there is a lxml.html.soupparser module that can also be used.

To summarize, either of the two worked for me:

from goose3.utils.encoding import smart_str

html = '...'
extruct.extract(smart_str(html), syntaxes=['json-ld'])
from lxml.html import soupparser

html = '...'
extruct.extract(soupparser.fromstring(html), syntaxes=['json-ld'])
lironesamoun commented 1 year ago

Interesting ! Thanks for the info. I'll definitely try that !

lopuhin commented 1 year ago

Another option is to parse the HTML on your end and pass an already parsed tree (in lxml.html format) to the extruct library, most syntaxes support that in the last release. For example we're internally using an HTML5 parser https://github.com/kovidgoyal/html5-parser/ passing treebuilder='lxml_html' (happy to share more details), which is more compatible compared to default lxml.html parser.

trifle commented 1 year ago

Hi everyone, I believe this is fundamentally an encoding issue, as vasniktel suggested. Try to feed extruct directly with bytes instead of (mistakenly) utf-8 decoded strings to prevent it from happening.

Example:

import requests
import extruct

u = "https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html"
r = requests.get(u) # note that r.content is a bytes object

# crashes
extruct.extract(r.content.decode("utf-8"))
# works
extruct.extract(r.content)