Since 2.1.3 / October 2024, many recipes are exiting with Docker exit code 137 (memory exhausted)

benoit74 commented 3 weeks ago

Not sure there is much to do, maybe this is linked to the move to the "non-slim" Docker image? Or anything else changed in the environment?

See e.g. https://farm.openzim.org/pipeline/b04c3e6f-ded2-47e7-84f7-bbac8def6a8e and https://farm.openzim.org/pipeline/6e227685-1dbf-4399-90b5-10d73abb81cb and https://farm.openzim.org/pipeline/2b6fdba8-1f72-4802-9b86-13eede679968

rgaudin commented 3 weeks ago

What's the size delta in biggest XML file between previously successful and now crashing?

benoit74 commented 3 weeks ago

It is the same XML due to https://github.com/openzim/zimfarm/issues/1041

benoit74 commented 3 weeks ago

Oh no, maybe it is not same XML. Recipes ran on 7th of May and dumps are dated from 15th of May in S3. How do I get previous size since the files are gone, and the logs as well?

rgaudin commented 3 weeks ago

I feel we should not look for an external cause although one will most likely present itself.

It stopped at 27% ; during questions processing.
RAM-hungry step was passed already
there are only 5738 questions.
this leads towards a leak somewhere.
I see many 429 Client Error: Too Many Requests for url: https://i.sstatic.net in the logs. A ticket should probably be opened about that. Exceptions in threads can lead to terrible consequences. Probably an area to check.
If base image changed, trying with previous image should be tested early to rule it out or not.

benoit74 commented 3 weeks ago

Thank you, I've opened https://github.com/openzim/sotoki/issues/326

benoit74 commented 3 weeks ago

RAM-hungry step was passed already

Which is the RAM-hungry step ?

rgaudin commented 3 weeks ago

that's preparation.py ; it's mostly (purposedly) done via other tools started as subprocesses ; manipulating the XML files

benoit74 commented 3 weeks ago

Running 3dprinting.stackexchange.com_en with old 2.1.2 (instead of 2.1.3) succeeds with old amount of RAM: https://farm.openzim.org/pipeline/395cdbf7-612e-4be9-b514-200f842c76a1/debug

That being said, dependencies are not pinned on sotoki so lot's of things might have changed between 2.1.2 and 2.1.3

benoit74 commented 3 weeks ago

I've ran or.stackexchange.com_en on my machine and it is very strange.

I started by running 2.1.2 image, with top running from inside the container.

While memory usage at the beginning was quite moderate, when the scraper started to process Questions, I saw very high memory usage, up to 2.2G at the end of the crawl (or maybe even higher, but didn't saw that):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    356 root      20   0    6.2g   2.2g   0.0g S   0.9   3.5   4:01.28 sotoki
      1 root      20   0    0.0g   0.0g   0.0g S   0.0   0.0   0:00.02 bash
      9 root      20   0    0.1g   0.0g   0.0g S   0.0   0.0   0:01.30 redis-server
    364 root      20   0    0.0g   0.0g   0.0g R   0.0   0.0   0:00.04 top

I then started instrumentation on my machine, and ran again 2.1.2 and then 2.1.3. In both case, the memory usage was very comparable about 450M at peak. Below 2.1.2 is in green and 2.1.3 is in blue.

Then only remark is that 2.1.3 seems to run a little bit faster, but probably not a big deal.

This means I did not reproduced what I saw the first time I ran 2.1.2 and observed container processes with top.

I started again 2.1.2 with top running from inside the container, and I got same result as benchmarking graph.

I started again the recipe on same worker athena18 on which it failed before, with unmodified recipe, and it worked well: https://farm.openzim.org/pipeline/74c50440-841c-40bf-acd2-a75652bcc4c0

So it looks like there is some environmental factor in the expression of this issue. I will continue investigate.

benoit74 commented 3 weeks ago

Problem of first run leaking memory reproduced with 2 successives runs of 2.1.2 of windowsphone.stackexchange.com_en on my machine:

This confirms that:

problem is not linked to 2.1.3, it was still there in 2.1.2
problem is linked to some environmental factor causing a kind of memory leak (we used two different docker containers, without any mounted volumes but same image, so only the environment - web? - changes)

rgaudin commented 3 weeks ago

thnak you @benoit74; investigating memory leak with python is difficult. From my experience, it requires extreme rigor and documentation so that apples can be compared to apples as much as possible.

You're lucky you have both working and leaking scenarios in different images. I suggest you bisect the changes and test to find the culprit change(s). I'd start with reverting the dependencies update.

benoit74 commented 3 weeks ago

I probably nailed down the problem: in first run, we download / resize / upload to S3 cache many pictures. In subsequent runs, we only download the picture from cache.

I just ran again vegetarianism.stackexchange.com:

there was still (after previous Zimfarm runs) 132 images to download / resize / upload to S3
out of them, 49 raised a Resize Error for ...: 'Image is too small, Image size : xxx, Required size : 540' log
in total, 215 picture are directly fetched from S3 cache in second run (and none were downloaded from online)

So something is leaking memory in this async execution.

benoit74 commented 3 weeks ago

What I've found so far:

reducing the number of image executor workers from 100 to 10 does not changes much (if anything) in terms of memory consumption
there is no big picture to download which could allocated big amount of memory
it looks like the problem happen "somewhere" in the middle of questions processing, quite close to the end (nothing very precise, more a feeling based on few observations, to be analyzed further)

openzim / sotoki

Since 2.1.3 / October 2024, many recipes are exiting with Docker exit code 137 (memory exhausted) #325