Scheduling logic is not handling nicely the case where multiple web APIs are requested by a single worker

benoit74 commented 11 months ago

Location: API

Problem

In the logs of youzim.it, we have found multiple "we have no preventing tasks. oops" errors.

Root cause/fix is not yet identified but:

this is not expected
we probably know how to reproduce the error

Production context

youzim.it
sisyphus worker (selfish, running only zimit tasks)
mwoffliner workers (not selfish, running only mwoffliner tasks)
error happen from the moment where a mwoffliner (not assigned to selfish worker) task is requested up to the moment where it is reserved by another worker

Logs (time is UTC)

2023-10-21T15:35:16.550200841Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-14T15:27:18.66098005Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-14T15:27:20.761035627Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-14T15:30:20.531044035Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:39:36.334360852Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:40:23.907365077Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:42:38.34628639Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:43:26.144755195Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:45:40.354894661Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:46:28.245339313Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:48:42.343705986Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:49:30.422030294Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:51:44.340916169Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:52:32.598759755Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:54:46.34899812Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops

Probable reproducing steps

we have a selfish worker (sisyphus in PROD)
a task is requested but not assigned to the selfish worker (i.e. it won't be run by this worker)
- (probably not relevant) the task is assigned to the wikimedia platform and running mwoffliner
when the selfish worker request tasks to be done (GET /v1/requested-tasks/worker), we get the "we have no preventing tasks. oops" error in the API logs

benoit74 commented 11 months ago

First analysis above is wrong. Error is not raised by sisyphus but by mwoffliner2 or mwoffliner1 requests. sisyphus request happens few secs later and is ok.

mwoffliner2:

2023-11-17T07:42:38.346035918Z stderr F DEBUG:utils.scheduling:missing cpu:3, mem:0, dsk:0
2023-11-17T07:42:38.34628639Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:42:38.352302534Z stderr F [pid: 24|app: 0|req: 108/547] 100.64.6.175 () {52 vars in 1379 bytes} [Fri Nov 17 07:42:38 2023] GET /v1/requested-tasks/worker?worker=mwoffliner2&avail_cpu=0&avail_memory=10737418240&avail_disk=222104125440 => generated 51 bytes in 48 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:42:38.352618559Z stdout F 100.64.6.175 - - [17/Nov/2023:07:42:38 +0000] "GET /v1/requested-tasks/worker?worker=mwoffliner2&avail_cpu=0&avail_memory=10737418240&avail_disk=222104125440 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "185.15.56.1"

sisyphus:

2023-11-17T07:42:41.425674513Z stderr F DEBUG:utils.scheduling:no request doable by worker (selfish=True)
2023-11-17T07:42:41.430460519Z stderr F [pid: 26|app: 0|req: 213/548] 100.64.6.175 () {52 vars in 1370 bytes} [Fri Nov 17 07:42:41 2023] GET /v1/requested-tasks/worker?worker=sisyphus&avail_cpu=6&avail_memory=16106127360&avail_disk=222873661440 => generated 51 bytes in 18 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:42:41.430535758Z stdout F 100.64.6.175 - - [17/Nov/2023:07:42:41 +0000] "GET /v1/requested-tasks/worker?worker=sisyphus&avail_cpu=6&avail_memory=16106127360&avail_disk=222873661440 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "51.159.4.131"

mwoffliner1:

2023-11-17T07:43:26.144710139Z stderr F DEBUG:utils.scheduling:missing cpu:3, mem:0, dsk:0
2023-11-17T07:43:26.144755195Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:43:26.150531633Z stderr F [pid: 26|app: 0|req: 220/561] 100.64.6.175 () {52 vars in 1377 bytes} [Fri Nov 17 07:43:26 2023] GET /v1/requested-tasks/worker?worker=mwoffliner1&avail_cpu=0&avail_memory=4294967296&avail_disk=251886514176 => generated 51 bytes in 53 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:43:26.150865359Z stdout F 100.64.6.175 - - [17/Nov/2023:07:43:26 +0000] "GET /v1/requested-tasks/worker?worker=mwoffliner1&avail_cpu=0&avail_memory=4294967296&avail_disk=251886514176 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "185.15.56.1"

benoit74 commented 11 months ago

There is probably no preventing tasks because mwoffliner1 and mwoffliner2 are running tasks from both systems (farm.openzim.org and farm.youzim.it) and they are full of task from the other instance (here requests are done on youzim.it but the workers are running tasks from openzim.org).

openzim / zimfarm