Open wjdp opened 3 years ago
I haven't been able to replicate the issue. Your Crash video links point to the video that has been removed. Maybe that is the reason why you are getting this error. I suggest you check the video links before passing them to the extract. Here is the code that I used:
Code: import extruct import requests from w3lib.html import get_base_url
crash_links=['https://www.youtube.com/watch?v=987wzJ2NHBE','https://www.youtube.com/watch?v=0-EF60neguk']
for video_url in crash_links: response = requests.get(video_url) base_url = get_base_url(response.text, response.url) metadata=extruct.extract(response.text, base_url=base_url, uniform=True, syntaxes=['json-ld', 'microdata', 'opengraph']) print(metadata)
Output: {'microdata': [], 'json-ld': [], 'opengraph': []} {'microdata': [], 'json-ld': [], 'opengraph': []}
I replicated the issue using these YouTube links, https://www.youtube.com/watch?v=-J2e8OlBdPs, https://www.youtube.com/watch?v=qP07oyFTRXc, https://www.youtube.com/watch?v=BUrnfkxwozM.
As @wjdp suggested, it is because of the apostrophe in the channel name. json.loads() throws an error when the input contains hex codes like "\x27" (which is the apostrophe). I created a pull request #195 where I replace the hex code with the special characters themselves before passing to the json.loads() function.
I have some code to pull metadata from YouTube
Have noticed some recent crashing, but only on some videos.
No crash: https://www.youtube.com/watch?v=ZY48KUAZKhM https://www.youtube.com/watch?v=ZlVI7YJGHq0 Crash: https://www.youtube.com/watch?v=987wzJ2NHBE https://www.youtube.com/watch?v=0-EF60neguk
Common factor among those that crash is apostrophes in the channel name!
Haven't had a chance today to dig into much beyond triaging the above.