Open benoit74 opened 3 weeks ago
What's the size delta in biggest XML file between previously successful and now crashing?
It is the same XML due to https://github.com/openzim/zimfarm/issues/1041
Oh no, maybe it is not same XML. Recipes ran on 7th of May and dumps are dated from 15th of May in S3. How do I get previous size since the files are gone, and the logs as well?
I feel we should not look for an external cause although one will most likely present itself.
429 Client Error: Too Many Requests for url: https://i.sstatic.net
in the logs. A ticket should probably be opened about that. Exceptions in threads can lead to terrible consequences. Probably an area to check.Thank you, I've opened https://github.com/openzim/sotoki/issues/326
RAM-hungry step was passed already
Which is the RAM-hungry step ?
that's preparation.py ; it's mostly (purposedly) done via other tools started as subprocesses ; manipulating the XML files
Running 3dprinting.stackexchange.com_en with old 2.1.2 (instead of 2.1.3) succeeds with old amount of RAM: https://farm.openzim.org/pipeline/395cdbf7-612e-4be9-b514-200f842c76a1/debug
That being said, dependencies are not pinned on sotoki so lot's of things might have changed between 2.1.2 and 2.1.3
I've ran or.stackexchange.com_en on my machine and it is very strange.
I started by running 2.1.2 image, with top running from inside the container.
While memory usage at the beginning was quite moderate, when the scraper started to process Questions, I saw very high memory usage, up to 2.2G at the end of the crawl (or maybe even higher, but didn't saw that):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
356 root 20 0 6.2g 2.2g 0.0g S 0.9 3.5 4:01.28 sotoki
1 root 20 0 0.0g 0.0g 0.0g S 0.0 0.0 0:00.02 bash
9 root 20 0 0.1g 0.0g 0.0g S 0.0 0.0 0:01.30 redis-server
364 root 20 0 0.0g 0.0g 0.0g R 0.0 0.0 0:00.04 top
I then started instrumentation on my machine, and ran again 2.1.2 and then 2.1.3. In both case, the memory usage was very comparable about 450M at peak. Below 2.1.2 is in green and 2.1.3 is in blue.
Then only remark is that 2.1.3 seems to run a little bit faster, but probably not a big deal.
This means I did not reproduced what I saw the first time I ran 2.1.2 and observed container processes with top.
I started again 2.1.2 with top running from inside the container, and I got same result as benchmarking graph.
I started again the recipe on same worker athena18
on which it failed before, with unmodified recipe, and it worked well: https://farm.openzim.org/pipeline/74c50440-841c-40bf-acd2-a75652bcc4c0
So it looks like there is some environmental factor in the expression of this issue. I will continue investigate.
Problem of first run leaking memory reproduced with 2 successives runs of 2.1.2 of windowsphone.stackexchange.com_en on my machine:
This confirms that:
thnak you @benoit74; investigating memory leak with python is difficult. From my experience, it requires extreme rigor and documentation so that apples can be compared to apples as much as possible.
You're lucky you have both working and leaking scenarios in different images. I suggest you bisect the changes and test to find the culprit change(s). I'd start with reverting the dependencies update.
I probably nailed down the problem: in first run, we download / resize / upload to S3 cache many pictures. In subsequent runs, we only download the picture from cache.
I just ran again vegetarianism.stackexchange.com
:
Resize Error for ...: 'Image is too small, Image size : xxx, Required size : 540'
logSo something is leaking memory in this async execution.
What I've found so far:
Not sure there is much to do, maybe this is linked to the move to the "non-slim" Docker image? Or anything else changed in the environment?
See e.g. https://farm.openzim.org/pipeline/b04c3e6f-ded2-47e7-84f7-bbac8def6a8e and https://farm.openzim.org/pipeline/6e227685-1dbf-4399-90b5-10d73abb81cb and https://farm.openzim.org/pipeline/2b6fdba8-1f72-4802-9b86-13eede679968