Does the Zimfarm fail to run maxi recipes? - Githubissues

openzim / zimfarm

Farm operated by bots to grow and harvest new zim files

https://farm.openzim.org

GNU General Public License v3.0

84 stars 25 forks source link

Does the Zimfarm fail to run maxi recipes? #887

Open Popolechien opened 11 months ago

Popolechien commented 11 months ago

I just closed https://farm.openzim.org/pipeline/256727ae-8829-4525-82bc-53b096a890fc that had been running for 9 days.

I see however that the miniand nopic zims had been completed after a mere 6 to 10 hours. Only maxi is missing but 8 more days to do it seems a bit excessive (ms.wikipedia has 368,000 articles only).

Is there any chance that this should be related to https://phabricator.wikimedia.org/T349972 which seems to create massive files?

benoit74 commented 11 months ago

On this occurrence, it looks like michaelblob worker suddenly stopped to communicate with the Zimfarm around
2023-12-04T20:43:29, most probably while it was building the maxi ZIM.

Last messages of michaelblob on 2023-12-04:

2023-12-04T20:37:14.286106329Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:14 +0000] "GET /v1/requested-tasks/worker?worker=michaelblob&avail_cpu=12&avail_memory=195421011968&avail_disk=28593824202752 HTTP/1.1" 200 79 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:37:53.961086678Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:53 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:37:55.829350404Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:55 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:37:57.613751258Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:57 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:01.337622114Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:01 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:54.27636749Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:54 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:56.309537159Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:56 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:58.06076615Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:58 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:01.598918797Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:01 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:54.710507537Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:54 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:56.639187979Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:56 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:58.346728862Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:58 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:02.084086982Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:02 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:14.818090635Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:14 +0000] "GET /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 200 39348 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:15.309100132Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:15 +0000] "GET /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 200 107758 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:15.889824038Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:15 +0000] "GET /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 200 73967 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:16.343760886Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:16 +0000] "GET /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 200 66146 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:17.370361096Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:17 +0000] "GET /v1/requested-tasks/worker?worker=michaelblob&avail_cpu=12&avail_memory=195421011968&avail_disk=28593298997248 HTTP/1.1" 200 79 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:55.897426554Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:55 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:57.833398047Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:57 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:41:06.72435628Z stdout F 100.64.6.175 - - [04/Dec/2023:20:41:06 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:41:56.611895368Z stdout F 100.64.6.175 - - [04/Dec/2023:20:41:56 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:18.017879463Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:18 +0000] "GET /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 200 39348 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:20.62610347Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:20 +0000] "GET /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 200 107685 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:24.914277854Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:24 +0000] "GET /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 200 73967 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:29.269860867Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:29 +0000] "GET /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 200 66146 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:49:00.490534087Z stdout F 100.64.6.175 - - [04/Dec/2023:20:49:00 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:57:02.29148102Z stdout F 100.64.6.175 - - [04/Dec/2023:20:57:02 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"

michaelblob then restarted a bunch of times, with sometimes downtime between every stop and restart, and now it looks like it is not communicating with the Zimfarm anymore since 2023-12-08T17:26:35 (I have details about down and up times, but I don't think we care).

The task stayed running because by design we assume the worker might not be able to communicate with the Zimfarm anymore but still be running the task. Obviously this design has some limitations like this occurence, but this is still probably the best design in terms of resilience (avoiding to fail a task just due to a transient communication issue between worker and Zimfarm).

benoit74 commented 11 months ago

@Popolechien did you experienced the issue on another task or could you confirm that other tasks have succeeded to create the maxi ZIM? (looks like many have from my PoV, but I'm not sure to be looking at the right place)

Popolechien commented 11 months ago

@benoit74 there were a couple other tasks IIRC, but I suspect/kind of remember that they were also running on Michaelblob (though there may have been an mwoffliner in the mix, can't recall for sure). Sorry I can't be more helpful.

benoit74 commented 11 months ago

Without more info, let's close it for now

Popolechien commented 11 months ago

@benoit74 You may want to look at this task: https://farm.openzim.org/pipeline/c6d98afd-4066-459b-aef4-ee61b3e1ea7c

wikipedia_en_history_nopic_2023-12.zim was completed in about an hour on mwoffliner3, yet the scraper has been running for two extra days and is yet to produce the maxi version.

benoit74 commented 11 months ago

I don't known whether this is a zimfarm or a scrapper or a wikimedia server issue.

Only two days ago, we have this task: https://farm.openzim.org/pipeline/c566920b-5a01-4aab-a2c5-2ea0a3339733, on the same worker mwoffliner3, we made the maxi in only 3 days while it took 7 days one month ago. And the final ZIM is 28GB, way more than what we could expect for the maxi of wikipedia_en_history.

I really do not get where this variability might come from.

@kelson42 since you know a bit better the scrapper and the servers than I do, do you have any idea of an investigation we might do? Is there any solution to write some timestamp in the logs to know what is going on (it looks like the logs are not progressing anymore, but I don't know if it is like this only for 10 mins, 1h or 2 days)?

benoit74 commented 11 months ago

NOTA: maybe this is one more proof that we need to spend some time to centralize workers logs in Grafana as well (at least for the ones which are "under our control") ; probably not a big task in term of effort, more a topic of volume of logs which will be delivered to Grafana and for which we will have to pay ...

benoit74 commented 11 months ago

I can now confirm this is "kinda" a Zimfarm issue.

The node process of this task is not consuming anymore CPU, only 887M RAM, and it is impossible to strace it, it starts to write epoll_wait(13, and then hangs.

For comparison, the node process of the other task currently running on mwoffliner3 is consuming 100% to 200% CPU (1 to 2 whole CPU out of 8) and the strace is very chatty.

This looks like a nasty Docker / OS issue as far as I can tell for now.

benoit74 commented 11 months ago

I forgot to mention, the other node process, which works "well", consumes 1.6G RAM.

benoit74 commented 11 months ago

I just started stress-ng for few secs in the Docker container, and it seems to be in perfect shape, the stress test consumed CPU as intended and finished properly. So it looks maybe more like a node process issue?

benoit74 commented 11 months ago

Since we have no other solution for now, please cancel these tasks and track them in this issue. We would like to better understand the impact. For convenience, please report in this issue the task URL, the recipe name, the worker, the offliner (including version), and if offliner is mwoffliner the requested flavors and flavors already produced.

I just canceled many task which were all in the same "shape" (on all mwoffliner worker I can confirm the strace produces the same result for all these tasks):

Task	Recipe Name	Worker	Offliner	Flavors requested	Flavor produced
c6d98afd-4066-459b-aef4-ee61b3e1ea7c	wikipedia_en_history	mwoffliner3	mwoffliner 1.13.0	[ "nopic:nopic", "novid:maxi" ]	nopic
f911818e-5048-4cd6-a974-ca698c9686d0	wikipedia_pl_all	mwoffliner1	mwoffliner 1.13.0	[ "nodet,nopic:mini", "nopic:nopic", "novid:maxi" ]	mini, nopic
3c622fc1-82db-45e8-8139-5a16ba6e1b1f	wikipedia_zh_all	mwoffliner2	mwoffliner 1.13.0	[ "nodet,nopic:mini", "nopic:nopic", "novid:maxi" ]	mini, nopic
049dffe8-6513-4d79-aed1-84cffa5337fe	wikipedia_af_all	mwoffliner4	mwoffliner 1.13.0	[ "nopic:nopic", "novid:maxi" ]	nopic
6cfa93ce-4bd3-4617-81d5-7b80f9bc8055	wikipedia_ur_top	athena18	mwoffliner 1.13.0	[ "nodet,nopic:mini", "nopic:nopic", "novid:maxi" ]	mini, nopic