Open benoit74 opened 11 months ago
First analysis above is wrong. Error is not raised by sisyphus but by mwoffliner2 or mwoffliner1 requests. sisyphus request happens few secs later and is ok.
mwoffliner2:
2023-11-17T07:42:38.346035918Z stderr F DEBUG:utils.scheduling:missing cpu:3, mem:0, dsk:0
2023-11-17T07:42:38.34628639Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:42:38.352302534Z stderr F [pid: 24|app: 0|req: 108/547] 100.64.6.175 () {52 vars in 1379 bytes} [Fri Nov 17 07:42:38 2023] GET /v1/requested-tasks/worker?worker=mwoffliner2&avail_cpu=0&avail_memory=10737418240&avail_disk=222104125440 => generated 51 bytes in 48 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:42:38.352618559Z stdout F 100.64.6.175 - - [17/Nov/2023:07:42:38 +0000] "GET /v1/requested-tasks/worker?worker=mwoffliner2&avail_cpu=0&avail_memory=10737418240&avail_disk=222104125440 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "185.15.56.1"
sisyphus:
2023-11-17T07:42:41.425674513Z stderr F DEBUG:utils.scheduling:no request doable by worker (selfish=True)
2023-11-17T07:42:41.430460519Z stderr F [pid: 26|app: 0|req: 213/548] 100.64.6.175 () {52 vars in 1370 bytes} [Fri Nov 17 07:42:41 2023] GET /v1/requested-tasks/worker?worker=sisyphus&avail_cpu=6&avail_memory=16106127360&avail_disk=222873661440 => generated 51 bytes in 18 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:42:41.430535758Z stdout F 100.64.6.175 - - [17/Nov/2023:07:42:41 +0000] "GET /v1/requested-tasks/worker?worker=sisyphus&avail_cpu=6&avail_memory=16106127360&avail_disk=222873661440 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "51.159.4.131"
mwoffliner1:
2023-11-17T07:43:26.144710139Z stderr F DEBUG:utils.scheduling:missing cpu:3, mem:0, dsk:0
2023-11-17T07:43:26.144755195Z stderr F ERROR:utils.scheduling:we have no preventing tasks. oops
2023-11-17T07:43:26.150531633Z stderr F [pid: 26|app: 0|req: 220/561] 100.64.6.175 () {52 vars in 1377 bytes} [Fri Nov 17 07:43:26 2023] GET /v1/requested-tasks/worker?worker=mwoffliner1&avail_cpu=0&avail_memory=4294967296&avail_disk=251886514176 => generated 51 bytes in 53 msecs (HTTP/1.1 200) 3 headers in 103 bytes (1 switches on core 0)
2023-11-17T07:43:26.150865359Z stdout F 100.64.6.175 - - [17/Nov/2023:07:43:26 +0000] "GET /v1/requested-tasks/worker?worker=mwoffliner1&avail_cpu=0&avail_memory=4294967296&avail_disk=251886514176 HTTP/1.1" 200 79 "-" "python-requests/2.30.0" "185.15.56.1"
There is probably no preventing tasks because mwoffliner1 and mwoffliner2 are running tasks from both systems (farm.openzim.org and farm.youzim.it) and they are full of task from the other instance (here requests are done on youzim.it but the workers are running tasks from openzim.org).
Problem
In the logs of youzim.it, we have found multiple "we have no preventing tasks. oops" errors.
Root cause/fix is not yet identified but:
Production context
Logs (time is UTC)
Probable reproducing steps
wikimedia
platform and running mwoffliner