openzim / kolibri

Convert a Kolibri channel in ZIM file(s)
GNU General Public License v3.0
8 stars 12 forks source link

Extremely high CPU use by ffmpeg #75

Closed kevinmcmurtrie closed 7 months ago

kevinmcmurtrie commented 10 months ago

MPEG4 to WebM transcoding is extremely slow. https://farm.openzim.org/pipeline/c06a8148-7d9a-422c-b5b4-abfe93d51168 has been crawling along for two weeks while using 100% of all CPUs.

kevinmcmurtrie commented 10 months ago

Notes: I can't reproduce this slowness outside of the docker container.

1) There may be conflicting FFmpeg options that are causing excessive backtracking and retries. This sets both min and max bitrates and min and max quality rates. It's possible that this causes some backtracking.

'ffmpeg', '-y', '-i', 'file:/output/tmpmf3sut4d/afe1380b936363857ce244eb5eda4019.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '300k', '-maxrate', '300k', '-minrate', '300k', '-qmin', '30', '-qmax', '42', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', 'file:/tmp/tmph5ib_ryv/video.tmp.webm'

Something simpler may help. This targets a bitrate of 300kb within a 512kb window and gives a wider quality range: 'ffmpeg', '-y', '-i', 'file:/output/tmpmf3sut4d/afe1380b936363857ce244eb5eda4019.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '300k', '-bufsize', '512k', '-qmin', '20', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', 'file:/tmp/tmph5ib_ryv/video.tmp.webm'

At least for me, the second one is generates faster, looks better, and consumes about 1/3 the bandwidth. There's no minimum bitrate and no maximum q so it can fly past all of those motionless whiteboard images.

2) The FFmpeg encoder may be old?

benoit74 commented 10 months ago

Thank you for reporting this and doing some tests. I will have a look into it.

benoit74 commented 10 months ago

Regarding versions, image ghcr.io/openzim/kolibri:1.1.0 is using:

benoit74 commented 10 months ago

Regarding ffmpeg settings, @rgaudin @kelson42 do we have any past issue which discuss why these settings have been chosen? I imagine finding one preset to more or less rule them all is not an easy feat.

Encoding logic is coming from python-scraperlib and presets (we use the low quality webm version for Khan Academy recipe)

Webm low quality presets are coming from https://github.com/openzim/python-scraperlib/issues/14 (https://github.com/openzim/python-scraperlib/commit/78e2bb0e562b58f240436efb7b8700fa15deaa39#diff-2cc68edde814805fe24114313acdde91ae832adef02f7d0576675d74db3f7b58 more precisely) but I did not find any discussion there, so they probably have been ported from ted/youtube scrapers, but I failed to find any discussion over there.

rgaudin commented 10 months ago
benoit74 commented 10 months ago

I tested suggested settings on https://studio.learningequality.org/content/storage/b/7/b71ca7f102ae16e4023c9f49b015d6b7.mp4

I do not find a significant visual difference in the resulting file (but this is obviously very personal).

I confirm that processing is a little faster (from 10secs to 8secs) and file is more than 3 times smaller (from 2.7MB to 768KB, while original mp4 is 690KB).

I do not find any difference (in terms of processing time) between in Docker and on the host directly (same machine), so there is probably something strange/unusual in your Docker setup on your machine.

benoit74 commented 10 months ago

we chose to go with webm/vp9

vp9 or vp8? looking at the setting I believe we use vp8

rgaudin commented 10 months ago

Sorry it's a slip, vp8 of course

benoit74 commented 10 months ago

Progressing towards a merged PR on this will obviously needs significant testing with many kind of videos and we (@Kiwix) probably won't have sufficient bandwidth for this in the coming months.

Contributions are of course more than welcomed.

Note however that this effort might conflict with another initiative we might consider to start around choosing a different video codec (and JS libs to fallback when reader/browser does not support this codec). The test set (and testing procedure) will nevertheless be very useful and most probably reused.

kevinmcmurtrie commented 10 months ago

I was going to kill this task on pixelmemory because it has built up over 161 GB of files...but not really. The filesystem compression ratio is over 3:1 so it's only 52GB on disk. That should not be happening for video files.

kelson42 commented 10 months ago

At this stage it looks like we might move from webm/vp8 to mpg4/h264. If we go that direction, we should reassess our ffmpeg command line (in particular for low quality).

benoit74 commented 10 months ago

I was going to kill this task on pixelmemory because it has built up over 161 GB of files...but not really. The filesystem compression ratio is over 3:1 so it's only 52GB on disk. That should not be happening for video files.

I can only agree. And it match the 3:1 ratio we both observed when changing the ffmpeg settings. I compressed (with default Zip settings on Mac) the "big" video I previously obtained with current scraper ffmpeg settings and I confirm it compress very well (again a 3:1 ratio, going from 2.7M to 868MB) which shouldn't be possible for a video file.

benoit74 commented 10 months ago

Edit: 868KB, not 868MB