openzim / youtube

Create a ZIM file from a Youtube channel/username/playlist
GNU General Public License v3.0
46 stars 26 forks source link

New crash scenario in Zimfarm with UniversScience #76

Closed kelson42 closed 4 years ago

kelson42 commented 4 years ago

Not sure if this is a problem with zimfarm or youtube2zim:

[2020-04-10 06:44:28,732] DEBUG:query youtube-api for Channel #UCS_7tplUgzJG4DhA16re5Yg
[2020-04-10 06:44:28,953] DEBUG:query youtube-api for Channel #UCfxwT02Bu5R7l21uMAu8H1w
[2020-04-10 06:44:29,319] DEBUG:query youtube-api for Channel #UC9NmlZnGn7fwnKBS4mT7xYQ
[2020-04-10 06:44:29,835] DEBUG:query youtube-api for Channel #UC-4t_pSIeSY-dPSmYEvxi8Q
[2020-04-10 06:44:30,198] INFO:update general metadata
[2020-04-10 06:44:30,600] INFO:creating HTML files
[2020-04-10 06:44:34,517] ERROR:FAILED. An error occurred: 'rZe3t2qKSqg'
[2020-04-10 06:44:34,518] ERROR:'rZe3t2qKSqg'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/youtube2zim-2.1.3-py3.8.egg/youtube2zim/entrypoint.py", line 182, in main
    scraper.run()
  File "/usr/local/lib/python3.8/site-packages/youtube2zim-2.1.3-py3.8.egg/youtube2zim/scraper.py", line 310, in run
    self.make_html_files(succeeded)
  File "/usr/local/lib/python3.8/site-packages/youtube2zim-2.1.3-py3.8.egg/youtube2zim/scraper.py", line 744, in make_html_files
    author = videos_channels[video_id]
KeyError: 'rZe3t2qKSqg'

More at https://farm.openzim.org/pipeline/5e75046fb4859e124e89572e/debug

rgaudin commented 4 years ago

It's a scraper bug. Will have to look at logs in detail (417MB of logs) to find out why we don't have that channel's details at this stage.

rgaudin commented 4 years ago

@satyamtg could you take a look at that next? We'll release a new version once we have this fixed.

satyamtg commented 4 years ago

@satyamtg could you take a look at that next? We'll release a new version once we have this fixed.

@satyamtg Yup sure. On it.

satyamtg commented 4 years ago

@kelson42 @rgaudin I took a look into it. The video ID for which we get a KeyError has violated YouTube's terms and has been removed. Here's it's URL - https://www.youtube.com/watch?v=rZe3t2qKSqg However, it seems that this video ID somehow isn't filtered out in make_html_files().

videos = load_json(self.cache_dir, "videos").values()
# filter videos so we only include the ones we could retrieve
videos = list(filter(is_present, videos))
videos_channels = load_json(self.cache_dir, "videos_channels")

It seems that videos has the video ID but videos_channels doesn't have the video ID. I think that this is somehow related to the video being removed from YouTube. Also, the logs tell that the video was indeed downloaded.

{"log":"[youtube] rZe3t2qKSqg: Downloading webpage\n","stream":"stdout","time":"2020-04-01T21:09:56.069698497Z"}
{"log":"[youtube] rZe3t2qKSqg: Downloading MPD manifest\n","stream":"stdout","time":"2020-04-01T21:09:56.66793603Z"}
{"log":"[youtube] rZe3t2qKSqg: Downloading thumbnail ...\n","stream":"stdout","time":"2020-04-01T21:09:57.032787488Z"}
{"log":"[youtube] rZe3t2qKSqg: Writing thumbnail to: /output/build/videos/rZe3t2qKSqg/video.jpg\n","stream":"stdout","time":"2020-04-01T21:09:57.141971031Z"}
{"log":"[download] Destination: /output/build/videos/rZe3t2qKSqg/video.f243.webm\n","stream":"stdout","time":"2020-04-01T21:09:57.359422841Z"}
{"log":"[download] Destination: /output/build/videos/rZe3t2qKSqg/video.f251.webm\n","stream":"stdout","time":"2020-04-01T21:10:34.888337369Z"}
{"log":"[ffmpeg] Merging formats into \"/output/build/videos/rZe3t2qKSqg/video.webm\"\n","stream":"stdout","time":"2020-04-01T21:10:36.008388842Z"}
{"log":"Deleting original file /output/build/videos/rZe3t2qKSqg/video.f243.webm (pass -k to keep)\n","stream":"stdout","time":"2020-04-01T21:10:36.161601286Z"}
{"log":"Deleting original file /output/build/videos/rZe3t2qKSqg/video.f251.webm (pass -k to keep)\n","stream":"stdout","time":"2020-04-01T21:10:36.161980877Z"}
{"log":"[2020-04-01 21:10:36,196] INFO:recompress /output/build/videos/rZe3t2qKSqg/video.webm -\u003e /output/build/videos/rZe3t2qKSqg/video.webm video_format='webm' low_quality=True\n","stream":"stdout","time":"2020-04-01T21:10:36.196585183Z"}
{"log":"[2020-04-01 21:10:36,196] DEBUG:ffmpeg -y -i \"file:/output/build/videos/rZe3t2qKSqg/video.webm\" -codec:v \"libvpx\" -quality \"best\" -cpu-used \"0\" -b:v \"300k\" -qmin \"30\" -qmax \"42\" -maxrate \"300k\" -bufsize \"1000k\" -threads \"8\" -vf \"scale='480:trunc(ow/a/2)*2'\" -codec:a \"libvorbis\" -b:a \"128k\" \"file:/output/build/videos/rZe3t2qKSqg/video.tmp.webm\"\n","stream":"stdout","time":"2020-04-01T21:10:36.196675419Z"}
{"log":"Input #0, matroska,webm, from 'file:/output/build/videos/rZe3t2qKSqg/video.webm':\n","stream":"stderr","time":"2020-04-01T21:10:36.306819569Z"}
{"log":"Output #0, webm, to 'file:/output/build/videos/rZe3t2qKSqg/video.tmp.webm':\n","stream":"stderr","time":"2020-04-01T21:10:36.379676229Z"}
{"log":"DEBUG:https://www.googleapis.com:443 \"GET /youtube/v3/videos?id=CdmdtVFPIqg%2CUUqPFVGlHBU%2Cpoc9J0HFyT4%2CseBV4Se8ktw%2CGlshX1dLCvA%2CAIRQWGSJomE%2CZ96X4HoQg1E%2C-gxCYfYBMI4%2ChQagX1I65Ik%2CwGAVu7FwL1E%2CdmhhMse-asI%2CnueZ1f_eCrE%2CUOfHyJVAtSg%2C3AHkPpyfShc%2CpnR--xoVr7s%2ChYdeIvdxIhI%2Chk0HZksVBB8%2CsKRPzAVD8Yg%2CFhKs9LKzgqE%2CisSFVNjOhoU%2C5T680uxoL5c%2C0GXibzFkJW0%2CP-Cgoe3_UmE%2Ct2PtAuoOWSo%2CnWBYoU3KLkU%2CU5DDbZ_Vk8U%2C-8egj9vh6oI%2CmyCPk6IAGRE%2Cn6F0QTnunV0%2CsxL5jSgYmxU%2C7xsyV7x38uI%2CPAfb0wcMKF8%2CZIN_n1Nz6TM%2CeQN4oJO6AJY%2CLgXe7DLePg4%2CzlEF6QC55_0%2CENrbwhh5uQw%2CkiKN5Um4K3M%2CelFm0_7BWHo%2CvtdINjLU3_4%2C26oQiEFIgig%2CrZe3t2qKSqg%2CmzHp3t9VdZ0%2CpH7rUt6cVlk%2CXESA69zjzIk%2C3-MtAeC-kSE%2CAzcvUpfDFvA%2CHFXsUk0_8_4%2CvHROj30Srqo%2CMiqw3V5Qzf8\u0026part=snippet\u0026key=AIzaSyCLVbSyvEPhTUzP8XO8VcUSHzhyDtmUqQA\u0026maxResults=50 HTTP/1.1\" 200 None\n","stream":"stderr","time":"2020-04-10T06:43:51.948381267Z"}
{"log":"[2020-04-10 06:44:34,517] ERROR:FAILED. An error occurred: 'rZe3t2qKSqg'\n","stream":"stdout","time":"2020-04-10T06:44:34.518180052Z"}
{"log":"[2020-04-10 06:44:34,518] ERROR:'rZe3t2qKSqg'\n","stream":"stdout","time":"2020-04-10T06:44:34.520728026Z"}
{"log":"KeyError: 'rZe3t2qKSqg'\n","stream":"stdout","time":"2020-04-10T06:44:34.520829366Z"}

It seems that the video was taken down after it was downloaded but before the channel info was retrieved by get_videos_authors_info() thus resulting it to not be in the videos_channels json. We can avoid this error by checking if the video is present in the video_channel json before actually using it to prepare HTML. Let me know your thoughts about it. However, I don't think replicating this error would be possible since it was actually a coincidence that the video was taken down from YouTube at that very time. But we can surely prevent it from happening. A good solution would be to except the KeyError and skip creation of HTML for that very video, and clean it up. Also, we should check if a video exists on YouTube before trying to download from cache.