Open Popolechien opened 11 months ago
On this occurrence, it looks like michaelblob
worker suddenly stopped to communicate with the Zimfarm around
2023-12-04T20:43:29, most probably while it was building the maxi
ZIM.
Last messages of michaelblob
on 2023-12-04:
2023-12-04T20:37:14.286106329Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:14 +0000] "GET /v1/requested-tasks/worker?worker=michaelblob&avail_cpu=12&avail_memory=195421011968&avail_disk=28593824202752 HTTP/1.1" 200 79 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:37:53.961086678Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:53 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:37:55.829350404Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:55 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:37:57.613751258Z stdout F 100.64.6.175 - - [04/Dec/2023:20:37:57 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:01.337622114Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:01 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:54.27636749Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:54 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:56.309537159Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:56 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:38:58.06076615Z stdout F 100.64.6.175 - - [04/Dec/2023:20:38:58 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:01.598918797Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:01 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:54.710507537Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:54 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:56.639187979Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:56 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:39:58.346728862Z stdout F 100.64.6.175 - - [04/Dec/2023:20:39:58 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:02.084086982Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:02 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:14.818090635Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:14 +0000] "GET /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 200 39348 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:15.309100132Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:15 +0000] "GET /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 200 107758 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:15.889824038Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:15 +0000] "GET /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 200 73967 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:16.343760886Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:16 +0000] "GET /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 200 66146 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:17.370361096Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:17 +0000] "GET /v1/requested-tasks/worker?worker=michaelblob&avail_cpu=12&avail_memory=195421011968&avail_disk=28593298997248 HTTP/1.1" 200 79 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:55.897426554Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:55 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:40:57.833398047Z stdout F 100.64.6.175 - - [04/Dec/2023:20:40:57 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:41:06.72435628Z stdout F 100.64.6.175 - - [04/Dec/2023:20:41:06 +0000] "PATCH /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:41:56.611895368Z stdout F 100.64.6.175 - - [04/Dec/2023:20:41:56 +0000] "PATCH /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:18.017879463Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:18 +0000] "GET /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 200 39348 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:20.62610347Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:20 +0000] "GET /v1/tasks/602c7953-6b29-4b90-a7f4-90975a111891 HTTP/1.1" 200 107685 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:24.914277854Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:24 +0000] "GET /v1/tasks/bb471778-7f57-4ca3-89c9-35297098a075 HTTP/1.1" 200 73967 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:43:29.269860867Z stdout F 100.64.6.175 - - [04/Dec/2023:20:43:29 +0000] "GET /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 200 66146 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:49:00.490534087Z stdout F 100.64.6.175 - - [04/Dec/2023:20:49:00 +0000] "PATCH /v1/tasks/256727ae-8829-4525-82bc-53b096a890fc HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
2023-12-04T20:57:02.29148102Z stdout F 100.64.6.175 - - [04/Dec/2023:20:57:02 +0000] "PATCH /v1/tasks/2436c25b-d62e-406a-a7c1-cbd082fec32e HTTP/1.1" 204 0 "-" "python-requests/2.31.0" "70.108.9.176"
michaelblob
then restarted a bunch of times, with sometimes downtime between every stop and restart, and now it looks like it is not communicating with the Zimfarm anymore since 2023-12-08T17:26:35
(I have details about down and up times, but I don't think we care).
The task stayed running because by design we assume the worker might not be able to communicate with the Zimfarm anymore but still be running the task. Obviously this design has some limitations like this occurence, but this is still probably the best design in terms of resilience (avoiding to fail a task just due to a transient communication issue between worker and Zimfarm).
@Popolechien did you experienced the issue on another task or could you confirm that other tasks have succeeded to create the maxi ZIM? (looks like many have from my PoV, but I'm not sure to be looking at the right place)
@benoit74 there were a couple other tasks IIRC, but I suspect/kind of remember that they were also running on Michaelblob (though there may have been an mwoffliner in the mix, can't recall for sure). Sorry I can't be more helpful.
Without more info, let's close it for now
@benoit74 You may want to look at this task: https://farm.openzim.org/pipeline/c6d98afd-4066-459b-aef4-ee61b3e1ea7c
wikipedia_en_history_nopic_2023-12.zim was completed in about an hour on mwoffliner3, yet the scraper has been running for two extra days and is yet to produce the maxi
version.
I don't known whether this is a zimfarm or a scrapper or a wikimedia server issue.
Only two days ago, we have this task: https://farm.openzim.org/pipeline/c566920b-5a01-4aab-a2c5-2ea0a3339733, on the same worker mwoffliner3, we made the maxi
in only 3 days while it took 7 days one month ago. And the final ZIM is 28GB, way more than what we could expect for the maxi of wikipedia_en_history.
I really do not get where this variability might come from.
@kelson42 since you know a bit better the scrapper and the servers than I do, do you have any idea of an investigation we might do? Is there any solution to write some timestamp in the logs to know what is going on (it looks like the logs are not progressing anymore, but I don't know if it is like this only for 10 mins, 1h or 2 days)?
NOTA: maybe this is one more proof that we need to spend some time to centralize workers logs in Grafana as well (at least for the ones which are "under our control") ; probably not a big task in term of effort, more a topic of volume of logs which will be delivered to Grafana and for which we will have to pay ...
I can now confirm this is "kinda" a Zimfarm issue.
The node
process of this task is not consuming anymore CPU, only 887M RAM, and it is impossible to strace
it, it starts to write epoll_wait(13,
and then hangs.
For comparison, the node
process of the other task currently running on mwoffliner3 is consuming 100% to 200% CPU (1 to 2 whole CPU out of 8) and the strace
is very chatty.
This looks like a nasty Docker / OS issue as far as I can tell for now.
I forgot to mention, the other node
process, which works "well", consumes 1.6G RAM.
I just started stress-ng
for few secs in the Docker container, and it seems to be in perfect shape, the stress test consumed CPU as intended and finished properly. So it looks maybe more like a node
process issue?
Since we have no other solution for now, please cancel these tasks and track them in this issue. We would like to better understand the impact. For convenience, please report in this issue the task URL, the recipe name, the worker, the offliner (including version), and if offliner is mwoffliner the requested flavors and flavors already produced.
I just canceled many task which were all in the same "shape" (on all mwoffliner worker I can confirm the strace produces the same result for all these tasks):
Task | Recipe Name | Worker | Offliner | Flavors requested | Flavor produced |
---|---|---|---|---|---|
c6d98afd-4066-459b-aef4-ee61b3e1ea7c | wikipedia_en_history | mwoffliner3 | mwoffliner 1.13.0 | [ "nopic:nopic", "novid:maxi" ] | nopic |
f911818e-5048-4cd6-a974-ca698c9686d0 | wikipedia_pl_all | mwoffliner1 | mwoffliner 1.13.0 | [ "nodet,nopic:mini", "nopic:nopic", "novid:maxi" ] | mini, nopic |
3c622fc1-82db-45e8-8139-5a16ba6e1b1f | wikipedia_zh_all | mwoffliner2 | mwoffliner 1.13.0 | [ "nodet,nopic:mini", "nopic:nopic", "novid:maxi" ] | mini, nopic |
049dffe8-6513-4d79-aed1-84cffa5337fe | wikipedia_af_all | mwoffliner4 | mwoffliner 1.13.0 | [ "nopic:nopic", "novid:maxi" ] | nopic |
6cfa93ce-4bd3-4617-81d5-7b80f9bc8055 | wikipedia_ur_top | athena18 | mwoffliner 1.13.0 | [ "nodet,nopic:mini", "nopic:nopic", "novid:maxi" ] | mini, nopic |
I just closed https://farm.openzim.org/pipeline/256727ae-8829-4525-82bc-53b096a890fc that had been running for 9 days.
I see however that the
mini
andnopic
zims had been completed after a mere 6 to 10 hours. Onlymaxi
is missing but 8 more days to do it seems a bit excessive (ms.wikipedia has 368,000 articles only).Is there any chance that this should be related to https://phabricator.wikimedia.org/T349972 which seems to create massive files?