openzim / kolibri

Convert a Kolibri channel in ZIM file(s)
GNU General Public License v3.0
8 stars 13 forks source link

The scraper is forking too many FFmpeg instances #83

Closed benoit74 closed 4 months ago

benoit74 commented 5 months ago

It looks like the --threads parameter is not doing its intended job, or at least the scraper is starting way too many threads.

At least one of our worker contributor is reporting that khan_academy jobs are using 400 encoding threads on its machine because:

benoit74 commented 5 months ago

The number of video conversion processes is controlled by the --processes parameter.

When this parameter is not set (which is the case in khanacademy recipes) it defaults to:

The problem is that the detection of "scraper is running inside a Docker container" is done by is_running_inside_container function and this is not working (anymore, probably).

(kolibri2zim) ➜  kolibri git:(fix_video_caching) ✗ docker run --rm -it ghcr.io/openzim/kolibri:1.1.0 /bin/bash
root@2af773717bff:/output# python
Python 3.11.4 (main, Jul  4 2023, 05:25:16) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from kolibri2zim.constants import is_running_inside_container
>>> is_running_inside_container()
False
>>> 
root@2af773717bff:/output# cat /proc/self/cgroup
0::/

As far as I've understood, we have two parallel processing queues working together:

The idea was probably (and it makes sense) that videos processing is mostly CPU-bound while nodes processing (without video conversion) is mostly IO-bound ; it hence make sense to decouple the two.

However, since we know that ffmpeg is already forking as many threads as possible (at least as far as we've achieved to constrain him, to be tested again) and since we are usually running this scraper in the Zimfarm, I see no reason to have a --processes default different than 1.

Advanced users running the scraper on their own infrastructure and wanting more video conversion speed could tweak the --processes parameter if they wish.

rgaudin commented 5 months ago

Makes sense. I'd love if we could have an explicit ffmpeg flag that indicates how we expect it to behave. Cause we're relying on default behavior here AFAIK

benoit74 commented 5 months ago

I opened an issue about this ffmpeg threading issue: https://github.com/openzim/python-scraperlib/issues/123