openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
82 stars 25 forks source link

Some Zimfarm task are never finishing while the process is obviously finished #987

Open benoit74 opened 2 months ago

benoit74 commented 2 months ago

@kevinmcmurtrie reported:

Something is amiss with the worker manager. zimscraper_mwoffliner crashed and nothing noticed. zimscraper_ted finished and nothing noticed.

$ zimfarm ps
CONTAINER ID   tid       schedule name           task id                                CREATED       NAMES
38cd889c69cc   0dbbc     voa_learning_en_all     0dbbc1f4-8e60-471b-a158-b7fc77b370d0   6 hours ago   zimscraper_youtube_0dbbc
a711167ec440   0dbbc     voa_learning_en_all     0dbbc1f4-8e60-471b-a158-b7fc77b370d0   6 hours ago   dnscache_0dbbc
0fc91ef16670   0dbbc     voa_learning_en_all     0dbbc1f4-8e60-471b-a158-b7fc77b370d0   6 hours ago   zimtask_0dbbc
dbda16559f71   4436d     ted_topic_activism      4436db27-adcb-49c4-be0f-d74af0164bf0   2 days ago    dnscache_4436d
ad7dfd16fe54   4436d     ted_topic_activism      4436db27-adcb-49c4-be0f-d74af0164bf0   2 days ago    zimtask_4436d
428e1df3e4b3   12c2a     wikipedia_en_all_mini   12c2a80a-f0be-4fed-9d36-1078ab4b440c   2 days ago    dnscache_12c2a
4aed9b3cd658   12c2a     wikipedia_en_all_mini   12c2a80a-f0be-4fed-9d36-1078ab4b440c   2 days ago    zimtask_12c2a
b2bb4f750e7b                                                                            4 days ago    zimfarm-manager

$ zimfarm ps -a
CONTAINER ID   tid       schedule name                task id                                CREATED       NAMES
38cd889c69cc   0dbbc     voa_learning_en_all          0dbbc1f4-8e60-471b-a158-b7fc77b370d0   6 hours ago   zimscraper_youtube_0dbbc
a711167ec440   0dbbc     voa_learning_en_all          0dbbc1f4-8e60-471b-a158-b7fc77b370d0   6 hours ago   dnscache_0dbbc
0fc91ef16670   0dbbc     voa_learning_en_all          0dbbc1f4-8e60-471b-a158-b7fc77b370d0   6 hours ago   zimtask_0dbbc
f506be0a1d31   4436d     ted_topic_activism           4436db27-adcb-49c4-be0f-d74af0164bf0   2 days ago    zimscraper_ted_4436d
dbda16559f71   4436d     ted_topic_activism           4436db27-adcb-49c4-be0f-d74af0164bf0   2 days ago    dnscache_4436d
ad7dfd16fe54   4436d     ted_topic_activism           4436db27-adcb-49c4-be0f-d74af0164bf0   2 days ago    zimtask_4436d
32424149f271   12c2a     wikipedia_en_all_mini        12c2a80a-f0be-4fed-9d36-1078ab4b440c   2 days ago    zimscraper_mwoffliner_12c2a
428e1df3e4b3   12c2a     wikipedia_en_all_mini        12c2a80a-f0be-4fed-9d36-1078ab4b440c   2 days ago    dnscache_12c2a
4aed9b3cd658   12c2a     wikipedia_en_all_mini        12c2a80a-f0be-4fed-9d36-1078ab4b440c   2 days ago    zimtask_12c2a
b2bb4f750e7b                                                                                 4 days ago    zimfarm-manager
d437a06056d8   0c403     wikipedia_fr_all             0c403df2-04ae-4ba9-9ee0-88bc94f0741c   2 weeks ago   zimtask_0c403
572ac764701d   5516f     wikipedia_it_all             5516f87d-e250-4c7d-a327-495480fc87fb   5 weeks ago   zimtask_5516f
ae560bb4246a   4c6cb     math.stackexchange.com_en    4c6cbc3a-8cb8-4428-a539-4d68f8ca7a0e   5 weeks ago   zimtask_4c6cb
b1a893febb23   64eb3     stackoverflow.com_bard       64eb3e74-6581-4cd1-8e6e-d8f4c2ecdb8e   5 weeks ago   zimtask_64eb3
aff86fcb2816   2b216     rpg.stackexchange.com_en     2b21600a-753e-43f4-99da-c3240f197170   6 weeks ago   zimtask_2b216
ef4700145a7b   31b7d     stackoverflow.com_en_nopic   31b7da6f-b8d4-4af6-82f3-ba4461f6d1b0   6 weeks ago   zimtask_31b7d
33e0bf77ba7a   06514     math.stackexchange.com_en    0651425f-2214-44e4-a956-7c920f71bda2   6 weeks ago   zimtask_06514
2d7968791f52   f250d     stackoverflow.com_bard       f250d672-c5f0-43ba-bc63-ae4bbcfef9e1   6 weeks ago   zimtask_f250d

$ docker logs -t -n 25 32424149f271
2024-06-22T22:13:35.463255273Z T:118249; A:1727000; RA:1; CA:1726998; UA:1; FA:0; IA:1726977; C:36122; CC:36122; UC:0; WC:0
2024-06-22T22:14:43.723199506Z T:118317; A:1728000; RA:1; CA:1727998; UA:1; FA:0; IA:1727977; C:36143; CC:36143; UC:0; WC:0
2024-06-22T22:15:52.492638315Z T:118386; A:1729000; RA:1; CA:1728998; UA:1; FA:0; IA:1728977; C:36163; CC:36163; UC:0; WC:0
2024-06-22T22:17:44.308532506Z [error] [2024-06-22T22:17:44.308Z] Error downloading article Baby_O
2024-06-22T22:17:44.310943430Z [error] [2024-06-22T22:17:44.310Z] Failed to run mwoffliner after [130596s]: {
2024-06-22T22:17:44.310966809Z  "name": "AxiosError",
2024-06-22T22:17:44.310978679Z  "message": "Request failed with status code 500",
2024-06-22T22:17:44.310989159Z  "url": "https://en.wikipedia.org/api/rest_v1/page/html/Baby_O",
2024-06-22T22:17:44.310999319Z  "status": 500,
2024-06-22T22:17:44.311010639Z  "responseType": "json",
2024-06-22T22:17:44.311021729Z  "data": {
2024-06-22T22:17:44.311031579Z      "type": "https://mediawiki.org/wiki/HyperSwitch/errors/unknown_error",
2024-06-22T22:17:44.311041599Z      "method": "get",
2024-06-22T22:17:44.311051529Z      "uri": "/en.wikipedia.org/v1/page/html/Baby_O"
2024-06-22T22:17:44.311061349Z  }
2024-06-22T22:17:44.311070999Z }
2024-06-22T22:17:44.311081039Z [error] [2024-06-22T22:17:44.310Z] 
2024-06-22T22:17:44.311092749Z 
2024-06-22T22:17:44.311102549Z **********
2024-06-22T22:17:44.311112199Z 
2024-06-22T22:17:44.311121858Z Request failed with status code 500
2024-06-22T22:17:44.311131538Z 
2024-06-22T22:17:44.311141158Z **********
2024-06-22T22:17:44.311151038Z 
2024-06-22T22:17:44.311162158Z

$ docker logs -t -n 25 f506be0a1d31
2024-06-22T23:52:42.876250271Z  86600K .......... .......... .......... .......... .......... 99% 40.4M 0s
2024-06-22T23:52:42.878288518Z  86650K .......... .......... .......... .......... .......... 99% 24.0M 0s
2024-06-22T23:52:42.879758558Z  86700K .......... .......... .......... .......... .......... 99% 35.9M 0s
2024-06-22T23:52:42.881974034Z  86750K .......... .......... .......... .......... .......... 99% 21.0M 0s
2024-06-22T23:52:42.882443311Z  86800K .......... .......... .......... .......... .......... 99%  103M 0s
2024-06-22T23:52:42.883889681Z  86850K .......... .......... .......... .......... .......... 99% 33.8M 0s
2024-06-22T23:52:42.885778949Z  86900K .......... .......... .......... .......... .......... 99% 26.0M 0s
2024-06-22T23:52:42.887485228Z  86950K .......... .......... .......... .......... .......... 99% 28.4M 0s
2024-06-22T23:52:42.930227427Z  87000K .......... .......... .......... .......... .......... 99% 1.14M 0s
2024-06-22T23:52:42.931261850Z  87050K .......... .......... .......... .......... .......... 99% 46.3M 0s
2024-06-22T23:52:42.932252103Z  87100K .......... .......... .......... .......... .......... 99% 49.2M 0s
2024-06-22T23:52:42.934180861Z  87150K .......... .......... .......... .......... .......... 99% 27.1M 0s
2024-06-22T23:52:42.935385743Z  87200K .......... .......... .......... .......... .......... 99% 36.7M 0s
2024-06-22T23:52:42.935556312Z  87250K .......... .......... .......... .....                100%  205M=24s
2024-06-22T23:52:42.935566082Z 
2024-06-22T23:52:42.935615341Z 2024-06-22 23:52:42 (3.58 MB/s) - '/output/tmpj6khkklu/videos/45969/video.mp4' saved [89380032/89380032]
2024-06-22T23:52:42.935633951Z 
2024-06-22T23:52:43.058433654Z [ted2zim::2024-06-22 23:52:43,058] INFO:downloaded /output/tmpj6khkklu/videos/45969/speaker.webp from cache at speaker_image/45969
2024-06-22T23:52:43.226574449Z [ted2zim::2024-06-22 23:52:43,226] INFO:downloaded /output/tmpj6khkklu/videos/45969/thumbnail.webp from cache at thumbnail/45969
2024-06-22T23:55:19.021629243Z [ted2zim::2024-06-22 23:55:19,021] INFO:uploaded /output/tmpj6khkklu/videos/45969/video.webm to cache at webm/low/45969
2024-06-23T04:22:30.667207280Z [ted2zim::2024-06-23 04:22:30,667] INFO:building ZIM file
2024-06-23T04:22:53.758871509Z Resolve redirect
2024-06-23T04:22:53.759405575Z set index
2024-06-23T04:23:15.899388598Z [ted2zim::2024-06-23 04:23:15,899] INFO:removing temp folder
2024-06-23T04:23:16.299116000Z [ted2zim::2024-06-23 04:23:16,298] INFO:Done Everything
benoit74 commented 2 months ago

What we see is that the scraper container have finished (containers f506be0a1d31 and 32424149f271 gone).

But the task containers which are responsible to manage the whole task execution are still there (containers ad7dfd16fe54 and 4aed9b3cd658).

@kevinmcmurtrie could you please give us the last lines of these two containers as well?

kevinmcmurtrie commented 2 months ago

They don't seem to be doing anything.

$ docker logs -t -n 25 ad7dfd16fe54
2024-06-21T20:53:53.607318797Z  username=pixelmemory
2024-06-21T20:53:53.607327547Z  webapi_uris=['https://api.farm.openzim.org/v1']
2024-06-21T20:53:53.607335687Z  workdir=/data
2024-06-21T20:53:53.607350767Z  task_id=4436db27-adcb-49c4-be0f-d74af0164bf0
2024-06-21T20:53:53.607360177Z [2024-06-21 20:53:53,607: INFO] testing workdir at /data…
2024-06-21T20:53:53.607369237Z [2024-06-21 20:53:53,607: INFO]  workdir is available and writable
2024-06-21T20:53:53.607377407Z [2024-06-21 20:53:53,607: INFO] testing private key at /etc/ssh/keys/zimfarm…
2024-06-21T20:53:53.691983592Z [2024-06-21 20:53:53,691: INFO]  private key is available and readable (a3121794f8e27e40c2f20782fae70eab)
2024-06-21T20:53:53.692023802Z [2024-06-21 20:53:53,691: INFO] testing authentication with https://api.farm.openzim.org/v1…
2024-06-21T20:53:54.752378767Z [2024-06-21 20:53:54,752: INFO]  authentication successful
2024-06-21T20:53:54.752433407Z [2024-06-21 20:53:54,752: INFO] testing docker API on /var/run/docker.sock…
2024-06-21T20:53:54.770756971Z [2024-06-21 20:53:54,770: INFO]  docker API access successful
2024-06-21T20:53:54.782102019Z [2024-06-21 20:53:54,782: INFO] Host hardware resources:
2024-06-21T20:53:54.782126059Z  CPU : 20 (total) ;  11 (avail)
2024-06-21T20:53:54.782144659Z  RAM : 160 GiB (total) ;  134 GiB (avail)
2024-06-21T20:53:54.782151699Z  Disk: 3 TiB (configured) ; 9.59 TiB (avail) ; 203.13 GiB (reserved) ; 2.8 TiB (remain)
2024-06-21T20:53:54.782161509Z [2024-06-21 20:53:54,782: INFO] registering exit signals
2024-06-21T20:53:54.782169919Z [2024-06-21 20:53:54,782: INFO] Fetching task details for 4436db27-adcb-49c4-be0f-d74af0164bf0
2024-06-21T20:53:55.310394299Z [2024-06-21 20:53:55,310: INFO] Updating task-status=started
2024-06-21T20:53:55.846879537Z [2024-06-21 20:53:55,846: INFO] Setting-up workdir
2024-06-21T20:53:55.849309102Z [2024-06-21 20:53:55,849: INFO] Starting DNS cache
2024-06-21T20:53:57.033792153Z [2024-06-21 20:53:57,033: DEBUG] DNS Cache started using IPs: ['172.17.0.13']
2024-06-21T20:53:57.033822502Z [2024-06-21 20:53:57,033: INFO] Starting scraper. Expects files at: /Yosemite/zimfarm/data/4436db27-adcb-49c4-be0f-d74af0164bf0 
2024-06-21T20:53:57.035737170Z [2024-06-21 20:53:57,035: DEBUG] Pulling image ghcr.io/openzim/ted:dev
2024-06-21T20:54:24.173183862Z [2024-06-21 20:54:24,173: INFO] Updating task-status=scraper_started
$ docker logs -t -n 25 4aed9b3cd658
2024-06-21T10:01:00.779871640Z  webapi_uris=['https://api.farm.openzim.org/v1']
2024-06-21T10:01:00.779880010Z  workdir=/data
2024-06-21T10:01:00.779895310Z  task_id=12c2a80a-f0be-4fed-9d36-1078ab4b440c
2024-06-21T10:01:00.779906280Z [2024-06-21 10:01:00,779: INFO] testing workdir at /data…
2024-06-21T10:01:00.779918120Z [2024-06-21 10:01:00,779: INFO]  workdir is available and writable
2024-06-21T10:01:00.779926470Z [2024-06-21 10:01:00,779: INFO] testing private key at /etc/ssh/keys/zimfarm…
2024-06-21T10:01:00.863235667Z [2024-06-21 10:01:00,863: INFO]  private key is available and readable (a3121794f8e27e40c2f20782fae70eab)
2024-06-21T10:01:00.863266247Z [2024-06-21 10:01:00,863: INFO] testing authentication with https://api.farm.openzim.org/v1…
2024-06-21T10:01:01.973726177Z [2024-06-21 10:01:01,973: INFO]  authentication successful
2024-06-21T10:01:01.973772017Z [2024-06-21 10:01:01,973: INFO] testing docker API on /var/run/docker.sock…
2024-06-21T10:01:01.989067831Z [2024-06-21 10:01:01,989: INFO]  docker API access successful
2024-06-21T10:01:01.997389094Z [2024-06-21 10:01:01,997: INFO] Host hardware resources:
2024-06-21T10:01:01.997401284Z  CPU : 20 (total) ;  14 (avail)
2024-06-21T10:01:01.997415134Z  RAM : 160 GiB (total) ;  148 GiB (avail)
2024-06-21T10:01:01.997420814Z  Disk: 3 TiB (configured) ; 9.62 TiB (avail) ; 173.13 GiB (reserved) ; 2.83 TiB (remain)
2024-06-21T10:01:01.997427184Z [2024-06-21 10:01:01,997: INFO] registering exit signals
2024-06-21T10:01:01.997447344Z [2024-06-21 10:01:01,997: INFO] Fetching task details for 12c2a80a-f0be-4fed-9d36-1078ab4b440c
2024-06-21T10:01:02.509530841Z [2024-06-21 10:01:02,509: INFO] Updating task-status=started
2024-06-21T10:01:02.997001147Z [2024-06-21 10:01:02,996: INFO] Setting-up workdir
2024-06-21T10:01:02.999647909Z [2024-06-21 10:01:02,999: INFO] Starting DNS cache
2024-06-21T10:01:03.766344035Z [2024-06-21 10:01:03,766: DEBUG] DNS Cache started using IPs: ['172.17.0.10']
2024-06-21T10:01:03.766374615Z [2024-06-21 10:01:03,766: INFO] Starting scraper. Expects files at: /Yosemite/zimfarm/data/12c2a80a-f0be-4fed-9d36-1078ab4b440c 
2024-06-21T10:01:03.767748795Z [2024-06-21 10:01:03,767: DEBUG] Pulling image ghcr.io/openzim/mwoffliner:1.13.0
2024-06-21T10:01:05.176917791Z [2024-06-21 10:01:05,176: INFO] Updating task-status=scraper_started
2024-06-21T19:15:49.220726258Z [2024-06-21 19:15:49,220: ERROR] ConnectionError (attempt 1) for PATCH https://api.farm.openzim.org/v1/tasks/12c2a80a-f0be-4fed-9d36-1078ab4b440c -- HTTPSConnectionPool(host='api.farm.openzim.org', port=443): Read timed out. (read timeout=None)

Sending them a CTRL-C:

$ docker attach ad7dfd16fe54
^C[2024-06-24 06:49:43,950: INFO] received exit signal (Interrupt), shutting down…
[2024-06-24 06:49:43,950: INFO] Stopping all containers and actions
[2024-06-24 06:49:43,950: INFO] Stopping and removing monitor
[2024-06-24 06:49:43,950: INFO] Stopping and removing dnscache
[2024-06-24 06:49:44,314: INFO] Stopping and removing scraper
[2024-06-24 06:49:44,554: INFO] Stopping and removing log_uploader
[2024-06-24 06:49:44,554: INFO] Stopping and removing artifacts_uploader
[2024-06-24 06:49:44,554: INFO] Stopping and removing uploader
[2024-06-24 06:49:44,554: INFO] Stopping and removing checker
[2024-06-24 06:49:44,554: INFO] Removing task workdir /data
[2024-06-24 06:49:44,554: WARNING] ZIM files exists. removing anyway: [('ted_mul_activism_2024-06.zim', '4.66 GiB')]
[2024-06-24 06:49:44,556: INFO] Updating task-status=canceled
$ docker attach 4aed9b3cd658
^C[2024-06-24 06:50:52,176: INFO] received exit signal (Interrupt), shutting down…
[2024-06-24 06:50:52,176: INFO] Stopping all containers and actions
[2024-06-24 06:50:52,176: INFO] Stopping and removing monitor
[2024-06-24 06:50:52,176: INFO] Stopping and removing dnscache
[2024-06-24 06:50:52,530: INFO] Stopping and removing scraper
[2024-06-24 06:50:52,756: INFO] Stopping and removing log_uploader
[2024-06-24 06:50:52,756: INFO] Stopping and removing artifacts_uploader
[2024-06-24 06:50:52,756: INFO] Stopping and removing uploader
[2024-06-24 06:50:52,756: INFO] Stopping and removing checker
[2024-06-24 06:50:52,756: INFO] Removing task workdir /data
[2024-06-24 06:50:52,764: INFO] Updating task-status=canceled
benoit74 commented 2 months ago

So clearly it looks like we have a concern on the task manager which is not detecting that scraper container has finished / stopped (or not checking anymore maybe ...)

rgaudin commented 2 months ago

Thank you @kevinmcmurtrie ; next time please include the inspect output (at least the State part).

Task manager queries docker API for scraper container status and condiders the following as running:

https://github.com/openzim/zimfarm/blob/dc4a00fa00c0106ee576e00c7a6ed145f15ac464/workers/app/common/docker.py#L39

I'd wait for another occurrence (with the status clue) before taking action

benoit74 commented 2 months ago

Thank you @rgaudin for the wise input!