openzim / openedx

Open edX (to zim) scraper
GNU General Public License v3.0
8 stars 7 forks source link

Missing videos on greek phzh #145

Closed Popolechien closed 4 years ago

Popolechien commented 4 years ago

Well it's all in the title but the page is here http://tmp.kiwix.org:9991/phzh_core-greek-one_el_2020-08/A/course/core-ellenika-01/kalosorises-ksekina-edo/eisagoge-sto-programma/binteo-skhetika-sumboules-kai-mikra-mustika/index.html

rgaudin commented 4 years ago

Looking at the creation logs it looks like there was errors converting the youtube-downloaded video to webm.

satyamtg commented 4 years ago

This is indeed an ffmpeg/webm issue. The video apparently failes to convert once downloaded in parts by youtube_dl. Here's the output of running youtube_dl on the video for both Mp4 and WebM formats -

For Mp4

(venv) ➜  image_test python3 test.py
[youtube] wABgK72_QzU: Downloading webpage
[youtube] wABgK72_QzU: Downloading MPD manifest
[download] Destination: Καλωσόρισες – Ξεκίνα εδώ-wABgK72_QzU.mp4
[download] 100% of 8.63MiB in 00:09

For WebM

(venv) ➜  image_test python3 test.py
[youtube] wABgK72_QzU: Downloading webpage
[youtube] wABgK72_QzU: Downloading MPD manifest
[dashsegments] Total fragments: 25
[download] Destination: Καλωσόρισες – Ξεκίνα εδώ-wABgK72_QzU.f303.webm
[download] 100% of 11.67MiB in 01:04
[dashsegments] Total fragments: 15
[download] Destination: Καλωσόρισες – Ξεκίνα εδώ-wABgK72_QzU.f251.webm
[download] 100% of 1.82MiB in 00:03
[ffmpeg] Merging formats into "Καλωσόρισες – Ξεκίνα εδώ-wABgK72_QzU.webm"
ERROR: Conversion failed!

The debug from ffmpeg gave -

[webm @ 0x7fe4a182d800] Application provided invalid, non monotonically increasing dts to muxer in stream 0: 5620 >= 5600
rgaudin commented 4 years ago

This is an FFmpeg issue that's apparently due to the video file sent by youtube (during the mixing of audio-only and video-only).

Unfortunately, there's no easy way to fix this as we have no way to know this could happen. I suggest that in this scraper (because we know this happens here but we haven't seen it elsewhere), we change the format to best[ext={ext}]/best. It will fix this particular situation.

The drawback I see is that when requesting webm without low-quality (not our zimfarm usecase), all videos would get reencoded from mp4 to webm instead of just being downloaded and muxed.

@satyamtg what do you think?

satyamtg commented 4 years ago

I think we shall do a high quality template in scraperlib, which keeps the quality as consistent as it can be. Otherwise we can just do the suggested change in this scraper and always have low quality videos if using WebM

rgaudin commented 4 years ago
rgaudin commented 4 years ago

Online demo updated with latest ZIM ; problem is still present.

rgaudin commented 4 years ago

@Popolechien finally working ; updated the demo