scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
854 stars 113 forks source link

Crash on JSONDecodeError from body of YouTube page #171

Open wjdp opened 3 years ago

wjdp commented 3 years ago

I have some code to pull metadata from YouTube

response = requests.get(video_url)
metadata = extruct.extract(response.text, base_url="https://youtube.com")

Have noticed some recent crashing, but only on some videos.

No crash: https://www.youtube.com/watch?v=ZY48KUAZKhM https://www.youtube.com/watch?v=ZlVI7YJGHq0 Crash: https://www.youtube.com/watch?v=987wzJ2NHBE https://www.youtube.com/watch?v=0-EF60neguk

Common factor among those that crash is apostrophes in the channel name!

Traceback (most recent call last):
  File "/home/will/local/breda/src/dredger/ingest/tests/test_youtube.py", line 72, in test_one
    youtube.get_video_data("https://www.youtube.com/watch?v=987wzJ2NHBE")
  File "/home/will/local/breda/src/dredger/ingest/youtube.py", line 46, in get_video_data
    metadata = extruct.extract(response.text, base_url="https://youtube.com")
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/_extruct.py", line 108, in extract
    output[syntax] = list(extract(document, base_url=base_url))
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in extract_items
    return [
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in <listcomp>
    return [
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 38, in _extract_items
    data = jstyleson.loads(HTML_OR_JS_COMMENTLINE.sub('', script),strict=False)
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/jstyleson.py", line 123, in loads
    return json.loads(dispose(text), **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 211 (char 210)

Haven't had a chance today to dig into much beyond triaging the above.

udit19281 commented 2 years ago

I haven't been able to replicate the issue. Your Crash video links point to the video that has been removed. Maybe that is the reason why you are getting this error. I suggest you check the video links before passing them to the extract. Here is the code that I used:

Code: import extruct import requests from w3lib.html import get_base_url

crash_links=['https://www.youtube.com/watch?v=987wzJ2NHBE','https://www.youtube.com/watch?v=0-EF60neguk']

for video_url in crash_links: response = requests.get(video_url) base_url = get_base_url(response.text, response.url) metadata=extruct.extract(response.text, base_url=base_url, uniform=True, syntaxes=['json-ld', 'microdata', 'opengraph']) print(metadata)

Output: {'microdata': [], 'json-ld': [], 'opengraph': []} {'microdata': [], 'json-ld': [], 'opengraph': []}

AbhinavSE commented 2 years ago

I replicated the issue using these YouTube links, https://www.youtube.com/watch?v=-J2e8OlBdPs, https://www.youtube.com/watch?v=qP07oyFTRXc, https://www.youtube.com/watch?v=BUrnfkxwozM.

As @wjdp suggested, it is because of the apostrophe in the channel name. json.loads() throws an error when the input contains hex codes like "\x27" (which is the apostrophe). I created a pull request #195 where I replace the hex code with the special characters themselves before passing to the json.loads() function.