openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
81 stars 25 forks source link

Scheduling logic is not handling nicely the case where multiple web APIs are requested by a single worker #871

Open benoit74 opened 11 months ago

benoit74 commented 11 months ago

Problem

In the logs of youzim.it, we have found multiple "we have no preventing tasks. oops" errors.

Root cause/fix is not yet identified but:

Production context

Logs (time is UTC)

2023-10-21T15:35:16.550200841Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-14T15:27:18.66098005Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-14T15:27:20.761035627Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-14T15:30:20.531044035Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:39:36.334360852Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:40:23.907365077Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:42:38.34628639Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:43:26.144755195Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:45:40.354894661Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:46:28.245339313Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:48:42.343705986Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:49:30.422030294Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:51:44.340916169Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:52:32.598759755Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:54:46.34899812Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops

Probable reproducing steps

benoit74 commented 11 months ago

First analysis above is wrong. Error is not raised by sisyphus but by mwoffliner2 or mwoffliner1 requests. sisyphus request happens few secs later and is ok.

mwoffliner2:

2023-11-17T07:42:38.346035918Z stderr F DEBUG:utils.scheduling:missing cpu:3, mem:0, dsk:0
2023-11-17T07:42:38.34628639Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:42:38.352302534Z stderr F [pid: 24|app: 0|req: 108/547] 100.64.6.175 () {52 vars in 1379 bytes} [Fri Nov 17 07:42:38 2023] GET /v1/requested-tasks/worker?worker=mwoffliner2&avail_cpu=0&avail_memory=10737418240&avail_disk=222104125440 => generated 51 bytes in 48 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:42:38.352618559Z stdout F 100.64.6.175 - - [17/Nov/2023:07:42:38 +0000] "GET /v1/requested-tasks/worker?worker=mwoffliner2&avail_cpu=0&avail_memory=10737418240&avail_disk=222104125440 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "185.15.56.1"

sisyphus:

2023-11-17T07:42:41.425674513Z stderr F DEBUG:utils.scheduling:no request doable by worker (selfish=True)
2023-11-17T07:42:41.430460519Z stderr F [pid: 26|app: 0|req: 213/548] 100.64.6.175 () {52 vars in 1370 bytes} [Fri Nov 17 07:42:41 2023] GET /v1/requested-tasks/worker?worker=sisyphus&avail_cpu=6&avail_memory=16106127360&avail_disk=222873661440 => generated 51 bytes in 18 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:42:41.430535758Z stdout F 100.64.6.175 - - [17/Nov/2023:07:42:41 +0000] "GET /v1/requested-tasks/worker?worker=sisyphus&avail_cpu=6&avail_memory=16106127360&avail_disk=222873661440 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "51.159.4.131"

mwoffliner1:

2023-11-17T07:43:26.144710139Z stderr F DEBUG:utils.scheduling:missing cpu:3, mem:0, dsk:0
2023-11-17T07:43:26.144755195Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:43:26.150531633Z stderr F [pid: 26|app: 0|req: 220/561] 100.64.6.175 () {52 vars in 1377 bytes} [Fri Nov 17 07:43:26 2023] GET /v1/requested-tasks/worker?worker=mwoffliner1&avail_cpu=0&avail_memory=4294967296&avail_disk=251886514176 => generated 51 bytes in 53 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:43:26.150865359Z stdout F 100.64.6.175 - - [17/Nov/2023:07:43:26 +0000] "GET /v1/requested-tasks/worker?worker=mwoffliner1&avail_cpu=0&avail_memory=4294967296&avail_disk=251886514176 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "185.15.56.1"
benoit74 commented 11 months ago

There is probably no preventing tasks because mwoffliner1 and mwoffliner2 are running tasks from both systems (farm.openzim.org and farm.youzim.it) and they are full of task from the other instance (here requests are done on youzim.it but the workers are running tasks from openzim.org).