openzim / youtube

Create a ZIM file from a Youtube channel/username/playlist
GNU General Public License v3.0
43 stars 26 forks source link

New crash scenario pattern with mindfield scraping #138

Closed kelson42 closed 3 years ago

kelson42 commented 3 years ago

See https://farm.openzim.org/pipeline/b2e47e70159a9be6d08f6ff5/debug

[youtube2zim::2021-01-10 13:20:01,636] INFO:download all author's profile pictures [youtube2zim::2021-01-10 13:20:01,636] DEBUG:query youtube-api for Channel #UC6nSFpj9HTCZ5t-N3Rm3-HA [youtube2zim::2021-01-10 13:20:01,832] INFO:update general metadata [youtube2zim::2021-01-10 13:20:01,832] DEBUG:query youtube-api for Channel #CSANnvayMtMLizmxt8koEHH5nj1ZR7xq6Pnu-VMYg [youtube2zim::2021-01-10 13:20:01,959] ERROR:FAILED. An error occurred: 'items' [youtube2zim::2021-01-10 13:20:01,959] ERROR:'items' Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/youtube2zim-2.1.12-py3.8.egg/youtube2zim/entrypoint.py", line 202, in main scraper.run() File "/usr/local/lib/python3.8/site-packages/youtube2zim-2.1.12-py3.8.egg/youtube2zim/scraper.py", line 310, in run self.update_metadata() File "/usr/local/lib/python3.8/site-packages/youtube2zim-2.1.12-py3.8.egg/youtube2zim/scraper.py", line 727, in update_metadata main_channel_json = get_channel_json(self.main_channel_id) File "/usr/local/lib/python3.8/site-packages/youtube2zim-2.1.12-py3.8.egg/youtube2zim/youtube.py", line 88, in get_channel_json channel_json = req.json()["items"][0] KeyError: 'items'

rgaudin commented 3 years ago

This is happening because of inconsistent data returned by the Youtube API:

When querying playlist PLZRRxQcaEjA4qyEuYfAMCazlL0vQDkIj2, the API tells us the creator_id is Channel CSANnvayMtMLizmxt8koEHH5nj1ZR7xq6Pnu-VMYg.

As you can see, this channel doesn't raise a 404 as incorrect IDs does but the page is blank. If you click on the playlist link above, it says the channel is Channel vsauce1 (which apparently is same as User Vsauce.

So we have no way, from the API, to know about Vsauce and we can only query for that channel ID returned which failes because there is no item returned by the API for it.

I don't want to fix this because:

I guess this is due to the fact that this was a Youtube Originals content (reserved to paying members) that got opened later.

I suggest we close this and disable the recipe.