openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
223 stars 26 forks source link

Since 2.1.3 / October 2024, many recipes are exiting with Docker exit code 137 (memory exhausted) #325

Open benoit74 opened 3 weeks ago

benoit74 commented 3 weeks ago

Not sure there is much to do, maybe this is linked to the move to the "non-slim" Docker image? Or anything else changed in the environment?

See e.g. https://farm.openzim.org/pipeline/b04c3e6f-ded2-47e7-84f7-bbac8def6a8e and https://farm.openzim.org/pipeline/6e227685-1dbf-4399-90b5-10d73abb81cb and https://farm.openzim.org/pipeline/2b6fdba8-1f72-4802-9b86-13eede679968

rgaudin commented 3 weeks ago

What's the size delta in biggest XML file between previously successful and now crashing?

benoit74 commented 3 weeks ago

It is the same XML due to https://github.com/openzim/zimfarm/issues/1041

benoit74 commented 3 weeks ago

Oh no, maybe it is not same XML. Recipes ran on 7th of May and dumps are dated from 15th of May in S3. How do I get previous size since the files are gone, and the logs as well?

rgaudin commented 3 weeks ago

I feel we should not look for an external cause although one will most likely present itself.

benoit74 commented 3 weeks ago

Thank you, I've opened https://github.com/openzim/sotoki/issues/326

benoit74 commented 3 weeks ago

RAM-hungry step was passed already

Which is the RAM-hungry step ?

rgaudin commented 3 weeks ago

that's preparation.py ; it's mostly (purposedly) done via other tools started as subprocesses ; manipulating the XML files

benoit74 commented 3 weeks ago

Running 3dprinting.stackexchange.com_en with old 2.1.2 (instead of 2.1.3) succeeds with old amount of RAM: https://farm.openzim.org/pipeline/395cdbf7-612e-4be9-b514-200f842c76a1/debug

That being said, dependencies are not pinned on sotoki so lot's of things might have changed between 2.1.2 and 2.1.3

benoit74 commented 3 weeks ago

I've ran or.stackexchange.com_en on my machine and it is very strange.

I started by running 2.1.2 image, with top running from inside the container.

While memory usage at the beginning was quite moderate, when the scraper started to process Questions, I saw very high memory usage, up to 2.2G at the end of the crawl (or maybe even higher, but didn't saw that):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    356 root      20   0    6.2g   2.2g   0.0g S   0.9   3.5   4:01.28 sotoki
      1 root      20   0    0.0g   0.0g   0.0g S   0.0   0.0   0:00.02 bash
      9 root      20   0    0.1g   0.0g   0.0g S   0.0   0.0   0:01.30 redis-server
    364 root      20   0    0.0g   0.0g   0.0g R   0.0   0.0   0:00.04 top

I then started instrumentation on my machine, and ran again 2.1.2 and then 2.1.3. In both case, the memory usage was very comparable about 450M at peak. Below 2.1.2 is in green and 2.1.3 is in blue.

Image

Then only remark is that 2.1.3 seems to run a little bit faster, but probably not a big deal.

This means I did not reproduced what I saw the first time I ran 2.1.2 and observed container processes with top.

I started again 2.1.2 with top running from inside the container, and I got same result as benchmarking graph.

I started again the recipe on same worker athena18 on which it failed before, with unmodified recipe, and it worked well: https://farm.openzim.org/pipeline/74c50440-841c-40bf-acd2-a75652bcc4c0

So it looks like there is some environmental factor in the expression of this issue. I will continue investigate.

benoit74 commented 3 weeks ago

Problem of first run leaking memory reproduced with 2 successives runs of 2.1.2 of windowsphone.stackexchange.com_en on my machine:

Image

This confirms that:

rgaudin commented 3 weeks ago

thnak you @benoit74; investigating memory leak with python is difficult. From my experience, it requires extreme rigor and documentation so that apples can be compared to apples as much as possible.

You're lucky you have both working and leaking scenarios in different images. I suggest you bisect the changes and test to find the culprit change(s). I'd start with reverting the dependencies update.

benoit74 commented 3 weeks ago

I probably nailed down the problem: in first run, we download / resize / upload to S3 cache many pictures. In subsequent runs, we only download the picture from cache.

I just ran again vegetarianism.stackexchange.com:

So something is leaking memory in this async execution.

benoit74 commented 3 weeks ago

What I've found so far: