official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 130 forks source link

Dynamic updates of connections (to respect connection limit). #2050

Closed vdbergh closed 3 months ago

vdbergh commented 4 months ago

This is a PR on top of #2049

vdbergh commented 4 months ago

I am a bit conflicted about this PR.

There is no mechanism to validate the code. This seems to require a more complicated data structure.

vondele commented 4 months ago

actually, just an idea, do we still need this limit? I don't think we have really meaningfully used it, and we have now some better blocking mechanisms in place for users and workers that could go rogue.

vdbergh commented 4 months ago

I would not be opposed to deleting that code.

vdbergh commented 3 months ago

@ppigazzini What is your opinion?

One advantage of having the restriction inside Fishtest is that we can return a meaningful error message to the user.

vondele commented 3 months ago

my second thought was that actually the check protects a bit against an attack, from a single IP address. But that's no serious hurdle for somebody malicious. So, yeah, I'm not 100% certain removing it is a good idea.

ppigazzini commented 3 months ago

Tested fine on DEV, added to PROD to test it with the fleet load.

vondele commented 3 months ago

no obvious issues on prod for me. request_task average time remained roughly the same time. Sampling gives the following profile:

===============================================================
Collecting 120 samples for pserve 6543 spaced 0.5 seconds apart
Total time needed 60 seconds 
Found pid 24666 for pserve 6543
Showing 10 most frequent samples traces: 
38 
     File "util.py", line 749, in __next_schedule self.__event.wait(delay) 
20 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 837, in request_task with self.request_task_lock: 
9 
     File "api.py", line 308, in beat self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 489, in get_run with self.run_cache_lock: 
8 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 838, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 951, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 41, in worker_name def worker_name(worker_info, short=False): 
8 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 838, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 940, in sync_request_task (run, task_id, task) 
6 
     File "util.py", line 751, in __next_schedule next_task.do_work() 
          File "util.py", line 634, in do_work self.worker(*self.args, *self.kwargs) 
               File "rundb.py", line 569, in flush_buffers self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run) 
5 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 838, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 951, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 44, in worker_name cores = str(worker_info["concurrency"]) 
3 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 838, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 951, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 47, in worker_name name = "{}-{}cores".format(username, cores) 
2 
     File "api.py", line 237, in update_task self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 489, in get_run with self.run_cache_lock: 
2 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 838, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 952, in sync_request_task if my_name == task_name: 
vdbergh commented 3 months ago

It seems some time is spent computing the worker_name associated with a task (short and long). This should be only done once, and the saved in the worker_info associated with the task.

vdbergh commented 3 months ago

File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) File "rundb.py", line 837, in request_task with self.request_task_lock:

It would probably be good to relax the synchronization on request_task() (i.e. make it possible for more than one copy to run in parallel). Only the part where the new task is attached to the run should remain synchronized.

EDIT: That synchronization is much finer since it is per run.

vondele commented 3 months ago

not directly sure if this is related to this patch, but image i.e. fleet workers have disconnected about 30min ago, but the worker list on the main page still shows them all.

pserve at 6545 100% loaded and currently:

# logging from [05/Jun/2024:04:26:20 +0000] to [05/Jun/2024:04:27:10 +0000] ########
# duration (seconds)            : 50
# calls in total                : 100
# calls per second              : 2
# calls not reaching the backend: 
# calls handled by the backend  : 100
#                         route    calls      total    average    minimum    maximum
                   /api/actions       12    338.327     28.194      8.318     30.003
                /api/upload_pgn        3     89.496     29.832     29.736     30.002
               /api/active_runs        1     30.000     30.000     30.000     30.000
                         /tests        4      2.237      0.559      0.014      0.781
                       /actions        1      0.298      0.298      0.298      0.298

sampling pserve 6545 doesn't give any samples (i.e. no threads in the python code?)

I've restarted the server.

Edit: which fixes the problem

vdbergh commented 3 months ago

I wonder if this means that mongod had died… with 6543 running off the cache.

vdbergh commented 3 months ago

If we decide to keep the connection limit then in a future PR I will use a slightly more general data structure whose validity can be checked (already prototyped).

Currently at any given instant the connection count is correct (or should be) but we cannot check this since it would require stopping the world (acquiring all run locks) which would be too expensive.

dubslow commented 3 months ago

fwiw, the user pages (e.g. mine but also all others) had died a few hours into the fleet. all user pages 504d. however all other pages and workers appeared to continue to work fine.

perhaps this worker list problem and user pages 504ing are related...?

vdbergh commented 3 months ago

It’s consistent with my theory that mongod was dead. The user pages are pulled from the db, but unlike the main page they are not cached IIRC.

ppigazzini commented 3 months ago

From systemd logs mongod stopped and restart: yesterday, when I added the PR, and this morning, when Joost restarted the server. Checking other logs.

~$ sudo journalctl -u mongod --since "24 hours ago"
-- Logs begin at Mon 2023-11-27 14:51:35 UTC, end at Wed 2024-06-05 11:52:40 UTC. --
Jun 04 21:59:42 tests.stockfishchess.org systemd[1]: Stopping MongoDB Database Server...
Jun 04 21:59:43 tests.stockfishchess.org systemd[1]: Stopped MongoDB Database Server.
Jun 04 21:59:43 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Jun 04 21:59:43 tests.stockfishchess.org mongod[24357]: {"t":{"$date":"2024-06-04T21:59:43.789Z"},"s":"I",  "c":"CONTROL",  "id":5760901, "ctx":"-","msg":"Applied --setParameter options","attr":{"serverParameters":{"tcmallocAggressiveMem
Jun 04 21:59:43 tests.stockfishchess.org mongod[24357]: {"t":{"$date":"2024-06-04T21:59:43.789Z"},"s":"I",  "c":"CONTROL",  "id":7484500, "ctx":"-","msg":"Environment variable MONGODB_CONFIG_OVERRIDE_NOFORK == 1, overriding \"processMana
Jun 05 04:33:21 tests.stockfishchess.org systemd[1]: Stopping MongoDB Database Server...
Jun 05 04:33:21 tests.stockfishchess.org systemd[1]: Stopped MongoDB Database Server.
Jun 05 04:33:21 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Jun 05 04:33:22 tests.stockfishchess.org mongod[3704]: {"t":{"$date":"2024-06-05T04:33:22.040Z"},"s":"I",  "c":"CONTROL",  "id":5760901, "ctx":"-","msg":"Applied --setParameter options","attr":{"serverParameters":{"tcmallocAggressiveMemo
Jun 05 04:33:22 tests.stockfishchess.org mongod[3704]: {"t":{"$date":"2024-06-05T04:33:22.041Z"},"s":"I",  "c":"CONTROL",  "id":7484500, "ctx":"-","msg":"Environment variable MONGODB_CONFIG_OVERRIDE_NOFORK == 1, overriding \"processManag
ppigazzini commented 3 months ago

mongod.log seems to show that the MongoDB was up and running fine, then it was restarted by Joost, keeping working fine.

{"t":{"$date":"2024-06-05T04:28:03.831+00:00"},"s":"I",  "c":"WRITE",    "id":51803,   "ctx":"conn18","msg":"Slow query","attr":{"type":"update","ns":"fishtest_new.runs","command":{"q":{"_id":{"$oid":"665d665cfd45fb0f907c7778"}},"u":{"_i
d":{"$oid":"665d665cfd45fb0f907c7778"},"version":0,"args":{"base_tag":"POnes^","new_tag":"POnes","base_nets":["nn-9b70b2b6fdf3.nnue","nn-37f18f62d772.nnue"],"new_nets":["nn-9b70b2b6fdf3.nnue","nn-37f18f62d772.nnue"],"num_games":800000,"t
c":"10+0.1","new_tc":"10+0.1","book":"UHO_Lichess_4852_v1.epd","book_depth":"8","threads":1,"resolved_base":"6348dc2d6aa0247328557b245a6343e6e689d7ca","resolved_new":"5384ef44638e5f4a5f03fd05c650309392d49f1d","master_sha":"8aaae0367cfed7
ae5da54d330b65d76d4b1b13ae","official_master_sha":"8aaae0367cfed7ae5da54d330b65d76d4b1b13ae","msg_base":"Disable VNNI","msg_new":"Don't call _mm_set1_epi16(1) each time.","base_options":"Hash=16","new_options":"Hash=16","info":"Don't cal

...

{"t":{"$date":"2024-06-05T04:32:22.723+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn9","msg":"Slow query","attr":{"type":"command","ns":"fishtest_new.runs","command":{"aggregate":"runs","pipeline":[{"$match":{"finished":tru
e}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"dcb07893-68a8-4f36-b340-15e30dd9788c"}},"$db":"fishtest_new"},"planSummary":"COUNT_SCAN { finished: 1, last_updated: -1 }","keysExamined":158850,"docsExamined":0
,"cursorExhausted":true,"numYields":158,"nreturned":1,"queryHash":"E8053644","planCacheKey":"683CFFBD","queryFramework":"classic","reslen":130,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"r":160}},"Global":{"acquireCount":{"r
":160}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"remote":"127.0.0.1:36894","protocol":"op_msg","durationMillis":124}}
{"t":{"$date":"2024-06-05T04:33:00.658+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn9","msg":"Slow query","attr":{"type":"command","ns":"fishtest_new.runs","command":{"aggregate":"runs","pipeline":[{"$match":{"finished":tru
e}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"dcb07893-68a8-4f36-b340-15e30dd9788c"}},"$db":"fishtest_new"},"planSummary":"COUNT_SCAN { finished: 1, last_updated: -1 }","keysExamined":158850,"docsExamined":0
,"cursorExhausted":true,"numYields":158,"nreturned":1,"queryHash":"E8053644","planCacheKey":"683CFFBD","queryFramework":"classic","reslen":130,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"r":160}},"Global":{"acquireCount":{"r
":160}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"remote":"127.0.0.1:36894","protocol":"op_msg","durationMillis":111}}
{"t":{"$date":"2024-06-05T04:33:18.089+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn9","msg":"Slow query","attr":{"type":"command","ns":"fishtest_new.runs","command":{"aggregate":"runs","pipeline":[{"$match":{"finished":tru
e}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"dcb07893-68a8-4f36-b340-15e30dd9788c"}},"$db":"fishtest_new"},"planSummary":"COUNT_SCAN { finished: 1, last_updated: -1 }","keysExamined":158850,"docsExamined":0
,"cursorExhausted":true,"numYields":158,"nreturned":1,"queryHash":"E8053644","planCacheKey":"683CFFBD","queryFramework":"classic","reslen":130,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"r":160}},"Global":{"acquireCount":{"r
":160}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"remote":"127.0.0.1:36894","protocol":"op_msg","durationMillis":105}}
{"t":{"$date":"2024-06-05T04:33:19.615+00:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn4","msg":"Interrupted operation as its client disconnected","attr":{"opId":620154}}
{"t":{"$date":"2024-06-05T04:33:19.615+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn4","msg":"Connection ended","attr":{"remote":"127.0.0.1:36866","uuid":"d7def0ef-f88d-4e13-afb6-ea80c946f89b","connectionId":4,"connectionCo
unt":20}}
{"t":{"$date":"2024-06-05T04:33:19.616+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn5","msg":"Connection ended","attr":{"remote":"127.0.0.1:36868","uuid":"58e8cf86-57d2-4ed3-9a8f-8924bdfe7938","connectionId":5,"connectionCo
unt":19}}
{"t":{"$date":"2024-06-05T04:33:19.616+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn6","msg":"Connection ended","attr":{"remote":"127.0.0.1:36870","uuid":"f75f55f0-68a7-46e6-9897-ca7671c35ae0","connectionId":6,"connectionCo
unt":18}}
{"t":{"$date":"2024-06-05T04:33:19.634+00:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn7","msg":"Interrupted operation as its client disconnected","attr":{"opId":620160}}
{"t":{"$date":"2024-06-05T04:33:19.635+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn7","msg":"Connection ended","attr":{"remote":"127.0.0.1:36884","uuid":"1216fdf0-9ece-4c28-af29-db034a2011c2","connectionId":7,"connectionCo
unt":17}}

...

{"t":{"$date":"2024-06-05T04:33:21.535+00:00"},"s":"I",  "c":"CONTROL",  "id":23138,   "ctx":"SignalHandler","msg":"Shutting down","attr":{"exitCode":0}}

{"t":{"$date":"2024-06-05T04:33:22.041+00:00"},"s":"I",  "c":"CONTROL",  "id":20698,   "ctx":"-","msg":"***** SERVER RESTARTED *****"}
{"t":{"$date":"2024-06-05T04:33:22.045+00:00"},"s":"I",  "c":"NETWORK",  "id":4915701, "ctx":"-","msg":"Initialized wire specification","attr":{"spec":{"incomingExternalClient":{"minWireVersion":0,"maxWireVersion":17},"incomingInternalCl
ient":{"minWireVersion":0,"maxWireVersion":17},"outgoing":{"minWireVersion":6,"maxWireVersion":17},"isInternalClient":true}}}
{"t":{"$date":"2024-06-05T04:33:22.046+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2024-06-05T04:33:22.048+00:00"},"s":"I",  "c":"NETWORK",  "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize
."}

...

{"t":{"$date":"2024-06-05T04:36:17.685+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn5","msg":"Slow query","attr":{"type":"command","ns":"fishtest_new.runs","command":{"aggregate":"runs","pipeline":[{"$match":{"finished":true}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"5cac520f-58f1-4458-afb4-46e5152e8c3d"}},"$db":"fishtest_new"},"planSummary":"COUNT_SCAN { finished: 1, last_updated: -1 }","keysExamined":158850,"docsExamined":0,"cursorExhausted":true,"numYields":158,"nreturned":1,"queryHash":"E8053644","planCacheKey":"16E7000D","queryFramework":"classic","reslen":130,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"r":160}},"Global":{"acquireCount":{"r":160}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"remote":"127.0.0.1:48022","protocol":"op_msg","durationMillis":100}}
vdbergh commented 3 months ago

mongod.log seems to show that the MongoDB was up and running fine, then it was restarted by Joost, keeping working fine.

Interesting.

pserve was running at 100%. What could cause this? Is it true that both waitress and pyramid run inside the pserve process (and not as separate python processes)?

vondele commented 3 months ago

In the logs:

The run object 66605cb2c340c8eed7756a00 does not validate: run['workers'] (value:-71) is not greater than or equal to

(see also https://github.com/official-stockfish/fishtest/pull/2049#issuecomment-2150259524)

logs look like

Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]: 2024-06-05 14:44:20,635 ERROR [waitress][waitress-1] Exception while serving /api/update_task
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]: Traceback (most recent call last):
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/tweens.py", line 13, in _error_handle
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = request.invoke_exception_view(exc_info)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/view.py", line 786, in invoke_excepti
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     raise HTTPNotFound
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]: pyramid.httpexceptions.HTTPNotFound: The resource could not be found.
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]: During handling of the above exception, another exception occurred:
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]: Traceback (most recent call last):
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/waitress/channel.py", line 428, in service
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     task.service()
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/waitress/task.py", line 168, in service
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     self.execute()
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/waitress/task.py", line 436, in execute
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     app_iter = self.channel.server.application(environ, start_response)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/waitress/proxy_headers.py", line 64, in trans
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     return app(environ, start_response)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/router.py", line 270, in __call__
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = self.execution_policy(environ, self)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/router.py", line 276, in default_exec
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     return router.invoke_request(request)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/router.py", line 245, in invoke_reque
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = handle_request(request)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/tweens.py", line 43, in excview_tween
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = _error_handler(request, exc)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/tweens.py", line 17, in _error_handle
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     reraise(*exc_info)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/util.py", line 733, in reraise
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     raise value
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/tweens.py", line 41, in excview_tween
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = handler(request)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/router.py", line 143, in handle_reque
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = _call_view(
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/view.py", line 674, in _call_view
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = view_callable(context, request)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/config/views.py", line 170, in attr_v
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     return view(context, request)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:            ^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/config/views.py", line 196, in predic
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     return view(context, request)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:            ^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/viewderivers.py", line 427, in render
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     result = view(context, request)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:              ^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/viewderivers.py", line 113, in _class
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     response = getattr(inst, attr)()
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:                ^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/fishtest/api.py", line 238, in update_task
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     result = self.request.rundb.update_task(
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 1240, in update_task
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     return self.sync_update_task(worker_info, run_id, task_id, stats, spsa)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 1358, in sync_update_task
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     self.stop_run(run_id)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 1411, in stop_run
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     self.set_inactive_run(run)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 143, in set_inactive_run
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     self.set_inactive_task(task_id, run)
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 161, in set_inactive_task
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     self.connections_counter[remote_addr] -= 1
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]:     ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
Jun 05 14:44:20 tests.stockfishchess.org pserve[3751]: KeyError: 'a.b.c.d'
Jun 05 14:44:54 tests.stockfishchess.org pserve[3751]: The run object 66605cb2c340c8eed7756a00 does not validate: run['workers'] (value:-71) is not greater than or equal to
Jun 05 14:44:54 tests.stockfishchess.org pserve[3751]: Request_task: refresh queue

Where KeyError: 'a.b.c.d' is a valid-looking IP address.

This keyError is there a few 100 times in the log, with different IP addresses.

vdbergh commented 3 months ago

Yek. I wonder where that comes from.

vdbergh commented 3 months ago

Ah sorry: a,b,c,d are just numbers.

vondele commented 3 months ago

yes, sorry, I deleted the IP address.

The first instance of the errors in the log is:

Jun 05 00:45:46 tests.stockfishchess.org pserve[24666]: Request_task: the server is currently too busy...
Jun 05 00:45:46 tests.stockfishchess.org pserve[24666]: Request_task: the server is currently too busy...
Jun 05 00:45:48 tests.stockfishchess.org pserve[24666]: Request_task: There is already a worker running with name "technologov-32cores-amd3990x-c5ac" which 82 seconds ago sent an update for task 665f9d57fd11ae7170b484ab/2523 (my name is "technologov-32cores-amd3990x-f1ab")
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: Request_task: the server is currently too busy...
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: Request_task: the server is currently too busy...
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: Request_task: the server is currently too busy...
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: 2024-06-05 00:45:49,919 ERROR [waitress][waitress-2] Exception while serving /api/update_task
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: Traceback (most recent call last):
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/tweens.py", line 13, in _error_handler
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:     response = request.invoke_exception_view(exc_info)
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/pyramid/view.py", line 786, in invoke_exception_view
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:     raise HTTPNotFound
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: pyramid.httpexceptions.HTTPNotFound: The resource could not be found.
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: During handling of the above exception, another exception occurred:
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]: Traceback (most recent call last):
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/waitress/channel.py", line 428, in service
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:     task.service()
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/waitress/task.py", line 168, in service
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:     self.execute()
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/env/lib/python3.12/site-packages/waitress/task.py", line 436, in execute
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:     app_iter = self.channel.server.application(environ, start_response)
....
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/fishtest/api.py", line 238, in update_task
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:     result = self.request.rundb.update_task(
Jun 05 00:45:49 tests.stockfishchess.org pserve[24666]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 1240, in update_task
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:     return self.sync_update_task(worker_info, run_id, task_id, stats, spsa)
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 1358, in sync_update_task
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:     self.stop_run(run_id)
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 1411, in stop_run
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:     self.set_inactive_run(run)
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 143, in set_inactive_run
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:     self.set_inactive_task(task_id, run)
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:   File "/home/fishtest/fishtest/server/fishtest/rundb.py", line 161, in set_inactive_task
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:     self.connections_counter[remote_addr] -= 1
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]:     ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
Jun 05 00:45:50 tests.stockfishchess.org pserve[24666]: KeyError: 'a.b.c.d'

where the IP address could correspond to the worker mentioned in There is already a worker running with name

vdbergh commented 3 months ago

I found another task["active"]=False ! In api.py. I forgot to look there.

However this is about stopped runs, which is not the case here.

vdbergh commented 3 months ago

Did this assertion trigger?

assert self.connections_counter[remote_addr] >= 0
vondele commented 3 months ago

not that I can find.

vdbergh commented 3 months ago

Hmm this seems impossible. I can understand why the run does not validate (task["active"]=False is set after the key error and hence is not executed). However I cannot understand how the key error can occur without triggering the Assertion first.

EDIT: I retract this.

vdbergh commented 3 months ago

Perhaps it is possible that the ip address can change during the execution of a task... ?

EDIT: I guess this is possible (load balancing perhaps). I think this means that we have to save the original ip_address in worker_info (unless we want to maintain another data structure).

vondele commented 3 months ago

it is possible, but probably rare, that the ip address changes... some ISP might change the IP every 24h or so.

vdbergh commented 3 months ago

Why don't we restrict the number of active tasks per user? Is there any downside to this? @ppigazzini @vondele

vondele commented 3 months ago

you mean instead of 'connections' from the same IP address? sure, actually think that makes more sense. We might need to raise the limit a bit for those users with a large number of workers (but different IP addresses).

vondele commented 3 months ago

thinking more about it, I guess this means we shouldn't be using the IP address at all?

vdbergh commented 3 months ago

thinking more about it, I guess this means we shouldn't be using the IP address at all?

Yes that's what I meant.

Meanwhile I fixed the current PR by disallowing ip changes. Since ip changes must be very rare this seems ok.

I propose to continue testing the PR in its current form to make sure the logic is correct. It would be trivial to change remote_addr to username.

vondele commented 3 months ago

I don't think you can 'disallow' IP changes, in that the user can do nothing about it, if his ISP has that policy.

vdbergh commented 3 months ago

Well it just means that the task fails. It can fail for any number of reasons.

EDIT: There will be a message in the server log. If it happens frequently then we can reconsider.

vondele commented 3 months ago

I guess we can log it and see how frequent it is. I guess a failed task is wasted compute, so we should try to minimize that.

Edit: edit crossed, I agree.

vdbergh commented 3 months ago

Currently I would like to make sure that the logic is correct.

vondele commented 3 months ago

image seems we can currently indeed have a negative number of cores assigned to a task. I guess that's just another side effect of the current issue.

vdbergh commented 3 months ago

Yes indeed. The exception occurs before task["alive"]=False but after the update of the aggregated data.

In the present PR I have exchanged the order, so that even if the connection logic has a bug and causes an exception, the aggregated data should be correct.

vdbergh commented 3 months ago

BTW the aggregated data is recomputed (and corrected if necessary) on server restart.

ppigazzini commented 3 months ago

pserve was running at 100%. What could cause this? Is it true that both waitress and pyramid run inside the pserve process (and not as separate python processes)?

pserve is a "loader" to set up a WSGI (in our case waitress) that loads the pyramid project. Few weeks ago, I ran PROD using gunicorn instead than waitress, here is the branch https://github.com/official-stockfish/fishtest/compare/master...ppigazzini:fishtest:gunicorn , nginxconfig needs a couple of tweaks for gunicorn, same for the fishtest systemd unit file.

After my test I think that gunicorn is a bad fit for fishtest. Because the run cache, I must setup 3 pserve instances with only one gunicorn worker of sync type. The semaphore for requests_task didn't work because gunicorn is single thread.

Testing gunicorn on PROD with the fleet we didn't view a lower CPU load wrt waitress

ppigazzini commented 3 months ago

PROD updated to https://github.com/official-stockfish/fishtest/pull/2050/commits/22d9e102701d066d5a2c2646094b86cf15045e4a

vdbergh commented 3 months ago

Alas https://tests.stockfishchess.org/api/get_run/66608972c340c8eed7757d02 now has negative cores/workers (and probably the other aggregated data is incorrect as well). The current PR checks and corrects the aggregated data for unfinished runs. But this run has finished.

I could make a temporary fix that also corrects specific unfinished runs as well.

vondele commented 3 months ago

The log now looks relatively clean.

sudo journalctl -u fishtest@6543 --since "30 minutes ago" | grep -v "Request_task: the server is currently" | grep -v "dead task: run: http" | grep -v "Failed_task: failur" | grep -v "Request_task: refresh queue"

-- Logs begin at Mon 2023-11-27 14:51:35 UTC, end at Wed 2024-06-05 22:11:06 UTC. --
Jun 05 21:42:49 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 6660d7bb6489614cdad13d66...
Jun 05 21:45:57 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 6660d8056489614cdad13d68...
Jun 05 21:48:57 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 6660d8796489614cdad13d72...
Jun 05 21:50:16 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/192 is not active
Jun 05 21:50:16 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/79 is not active
Jun 05 21:50:19 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4467 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:50:19 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4467 is not active
Jun 05 21:50:25 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4516 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:50:25 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4113 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:50:30 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/52 is not active
Jun 05 21:50:31 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/187 is not active
Jun 05 21:50:40 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/162 is not active
Jun 05 21:50:54 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4276 has incompatible stats. Before 24. Now 24. SPSA_games 4.
Jun 05 21:51:00 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4573 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:51:05 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3429 has incompatible stats. Before 40. Now 40. SPSA_games 4.
Jun 05 21:51:05 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/174 is not active
Jun 05 21:51:06 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3429 is not active
Jun 05 21:51:07 tests.stockfishchess.org pserve[7723]: Update_task: task 66608709c340c8eed7757cf0/1068 is not active
Jun 05 21:51:14 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/45 is not active
Jun 05 21:51:14 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/195 is not active
Jun 05 21:51:15 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/192 is not active
Jun 05 21:51:15 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4340 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:51:16 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4340 is not active
Jun 05 21:51:19 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3984 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:51:19 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3984 is not active
Jun 05 21:51:21 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3379 is not active
Jun 05 21:51:22 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/140 is not active
Jun 05 21:51:24 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/202 is not active
Jun 05 21:51:25 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/79 is not active
Jun 05 21:51:26 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/153 is not active
Jun 05 21:51:30 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4549 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:51:30 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4276 is not active
Jun 05 21:51:30 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d41a6489614cdad13d2a/854 has incompatible stats. Before 84. Now 78. SPSA_games 0.
Jun 05 21:51:31 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/160 is not active
Jun 05 21:51:36 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4313 has incompatible stats. Before 24. Now 24. SPSA_games 4.
Jun 05 21:51:36 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3850 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:51:36 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/192 is not active
Jun 05 21:51:36 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4313 is not active
Jun 05 21:51:37 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3850 is not active
Jun 05 21:51:37 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/56 is not active
Jun 05 21:51:40 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3379 is not active
Jun 05 21:51:40 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4540 is not active
Jun 05 21:51:40 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/14 is not active
Jun 05 21:51:40 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/4 is not active
Jun 05 21:51:43 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/140 is not active
Jun 05 21:51:43 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/202 is not active
Jun 05 21:51:46 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4334 is not active
Jun 05 21:51:47 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/145 is not active
Jun 05 21:51:54 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3800 has incompatible stats. Before 32. Now 32. SPSA_games 4.
Jun 05 21:51:54 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3800 is not active
Jun 05 21:51:58 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3440 has incompatible stats. Before 40. Now 40. SPSA_games 4.
Jun 05 21:51:58 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3624 has incompatible stats. Before 40. Now 40. SPSA_games 4.
Jun 05 21:51:58 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/24 is not active
Jun 05 21:51:58 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3440 is not active
Jun 05 21:51:59 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3624 is not active
Jun 05 21:52:01 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4500 is not active
Jun 05 21:52:01 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 6660dca56489614cdad13dad...
Jun 05 21:52:04 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/2 is not active
Jun 05 21:52:05 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/133 is not active
Jun 05 21:52:09 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/30 is not active
Jun 05 21:52:14 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/159 is not active
Jun 05 21:52:18 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4365 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:52:18 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4365 is not active
Jun 05 21:52:19 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/172 is not active
Jun 05 21:52:24 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3447 has incompatible stats. Before 40. Now 40. SPSA_games 4.
Jun 05 21:52:25 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3447 is not active
Jun 05 21:52:27 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4669 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:52:27 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4706 is not active
Jun 05 21:52:28 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3996 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:52:28 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4669 is not active
Jun 05 21:52:28 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3996 is not active
Jun 05 21:52:30 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/235 is not active
Jun 05 21:52:34 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3617 is not active
Jun 05 21:52:37 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4573 is not active
Jun 05 21:52:43 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4588 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:52:43 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/16 is not active
Jun 05 21:52:44 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4588 is not active
Jun 05 21:52:46 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/15 is not active
Jun 05 21:52:46 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4219 is not active
Jun 05 21:52:46 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/192 is not active
Jun 05 21:52:46 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4303 has incompatible stats. Before 24. Now 24. SPSA_games 4.
Jun 05 21:52:47 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4303 is not active
Jun 05 21:52:50 tests.stockfishchess.org pserve[7723]: Update_task: task 66608709c340c8eed7757cf0/1088 is not active
Jun 05 21:52:51 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/9 is not active
Jun 05 21:52:51 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4565 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:52:55 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4265 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:52:55 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4175 has incompatible stats. Before 24. Now 24. SPSA_games 4.
Jun 05 21:52:55 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/32 is not active
Jun 05 21:52:55 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4265 is not active
Jun 05 21:52:55 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4175 is not active
Jun 05 21:52:57 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4577 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:52:57 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4577 is not active
Jun 05 21:52:59 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/28 is not active
Jun 05 21:53:00 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/106 is not active
Jun 05 21:53:02 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/226 is not active
Jun 05 21:53:03 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/167 is not active
Jun 05 21:53:04 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4417 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:53:05 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4417 is not active
Jun 05 21:53:05 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/145 is not active
Jun 05 21:53:08 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/95 is not active
Jun 05 21:53:11 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/3823 is not active
Jun 05 21:53:14 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4039 is not active
Jun 05 21:53:17 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3864 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:53:17 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/171 is not active
Jun 05 21:53:19 tests.stockfishchess.org pserve[7723]: Failed_task: task 6660cad8c340c8eed77581aa/2839 is not active
Jun 05 21:53:19 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/3640 is not active
Jun 05 21:53:21 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/176 is not active
Jun 05 21:53:22 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4237 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:53:22 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/72 is not active
Jun 05 21:53:22 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/173 is not active
Jun 05 21:53:22 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4237 is not active
Jun 05 21:53:26 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4347 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:53:27 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4347 is not active
Jun 05 21:53:30 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/189 is not active
Jun 05 21:53:30 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/122 is not active
Jun 05 21:53:30 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3541 is not active
Jun 05 21:53:30 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/170 is not active
Jun 05 21:53:30 tests.stockfishchess.org pserve[7723]: Update_task: task 66608709c340c8eed7757cf0/1361 has incompatible stats. Before 224. Now 224. SPSA_games 28.
Jun 05 21:53:31 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4565 is not active
Jun 05 21:53:34 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/209 is not active
Jun 05 21:53:40 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4050 is not active
Jun 05 21:53:40 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/139 is not active
Jun 05 21:53:41 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/133 is not active
Jun 05 21:53:42 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4544 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:53:44 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4673 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:53:46 tests.stockfishchess.org pserve[7723]: Update_task: task 6660c982c340c8eed775818a/314 is not active
Jun 05 21:53:48 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/99 is not active
Jun 05 21:53:49 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/3630 is not active
Jun 05 21:53:49 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4066 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:53:50 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4066 is not active
Jun 05 21:53:52 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4598 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:53:53 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4598 is not active
Jun 05 21:53:53 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/88 is not active
Jun 05 21:53:53 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/2 is not active
Jun 05 21:53:55 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3985 has incompatible stats. Before 32. Now 32. SPSA_games 4.
Jun 05 21:53:55 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3985 is not active
Jun 05 21:53:56 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/247 is not active
Jun 05 21:53:58 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/76 is not active
Jun 05 21:53:58 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d41a6489614cdad13d2a/854 is not active
Jun 05 21:53:58 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/171 is not active
Jun 05 21:53:59 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3982 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:54:00 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3982 is not active
Jun 05 21:54:00 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/145 is not active
Jun 05 21:54:00 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4543 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:54:00 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/147 is not active
Jun 05 21:54:01 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4170 has incompatible stats. Before 24. Now 24. SPSA_games 4.
Jun 05 21:54:01 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4543 is not active
Jun 05 21:54:01 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4170 is not active
Jun 05 21:54:04 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/22 is not active
Jun 05 21:54:04 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4486 has incompatible stats. Before 16. Now 16. SPSA_games 4.
Jun 05 21:54:04 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/46 is not active
Jun 05 21:54:05 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4486 is not active
Jun 05 21:54:05 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4673 is not active
Jun 05 21:54:08 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/211 is not active
Jun 05 21:54:10 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3895 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:54:10 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/203 is not active
Jun 05 21:54:10 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3895 is not active
Jun 05 21:54:10 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/181 is not active
Jun 05 21:54:11 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5626489614cdad13d55/176 is not active
Jun 05 21:54:12 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4277 has incompatible stats. Before 24. Now 24. SPSA_games 4.
Jun 05 21:54:13 tests.stockfishchess.org pserve[7723]: Request_task: There is already a worker running with name "ChessDBCN-8cores-326bb783-2edd" which 89 seconds ago sent an update for task 66605b49c340c8eed77569f7/4555 (my name is "ChessDBCN-8cores-326bb783-d7b2")
Jun 05 21:54:14 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/139 is not active
Jun 05 21:54:16 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/212 is not active
Jun 05 21:54:18 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/141 is not active
Jun 05 21:54:19 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4578 is not active
Jun 05 21:54:20 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/95 is not active
Jun 05 21:54:23 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/74 is not active
Jun 05 21:54:23 tests.stockfishchess.org pserve[7723]: Update_task: task 66608709c340c8eed7757cf0/1178 is not active
Jun 05 21:54:23 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4098 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:54:23 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4098 is not active
Jun 05 21:54:26 tests.stockfishchess.org pserve[7723]: Request_spsa: task 66605b49c340c8eed77569f7/4439 is not active
Jun 05 21:54:31 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/99 is not active
Jun 05 21:54:31 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3864 is not active
Jun 05 21:54:41 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/76 is not active
Jun 05 21:54:41 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4277 is not active
Jun 05 21:54:46 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/36 is not active
Jun 05 21:54:46 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3933 has incompatible stats. Before 32. Now 32. SPSA_games 4.
Jun 05 21:54:46 tests.stockfishchess.org pserve[7723]: Update_task: task 6660c982c340c8eed775818a/330 is not active
Jun 05 21:54:47 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3933 is not active
Jun 05 21:54:50 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/155 is not active
Jun 05 21:54:52 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3873 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:54:53 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4009 has incompatible stats. Before 28. Now 28. SPSA_games 4.
Jun 05 21:54:53 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d6a06489614cdad13d60/58 is not active
Jun 05 21:54:53 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/4009 is not active
Jun 05 21:54:55 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 66605b49c340c8eed77569f7...
Jun 05 21:54:59 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/12 is not active
Jun 05 21:55:00 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/196 is not active
Jun 05 21:55:05 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3873 is not active
Jun 05 21:55:10 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d5326489614cdad13d53/181 is not active
Jun 05 21:55:16 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/95 is not active
Jun 05 21:55:29 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3864 is not active
Jun 05 21:55:47 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3686 has incompatible stats. Before 32. Now 32. SPSA_games 4.
Jun 05 21:55:47 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3686 is not active
Jun 05 21:57:47 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 6660575ec340c8eed77569e9...
Jun 05 21:57:55 tests.stockfishchess.org pserve[7723]: Update_task: task 6660d65b6489614cdad13d5d/213 is not active
Jun 05 21:58:05 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3708 has incompatible stats. Before 32. Now 32. SPSA_games 4.
Jun 05 21:58:05 tests.stockfishchess.org pserve[7723]: Failed_task: task 66605b49c340c8eed77569f7/3708 is not active
Jun 05 21:58:14 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3553 is not active
Jun 05 22:00:49 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 6660d5326489614cdad13d53...
Jun 05 22:02:06 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/3648 is not active
Jun 05 22:03:44 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 665d665cfd45fb0f907c7778...
Jun 05 22:06:46 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 6660dccc6489614cdad13db0...
Jun 05 22:07:56 tests.stockfishchess.org pserve[7723]: Update_task: task 66605b49c340c8eed77569f7/4544 is not active
Jun 05 22:09:47 tests.stockfishchess.org pserve[7723]: Validate_random_run: validated aggregated data in cache run 665d665cfd45fb0f907c7778...
vdbergh commented 3 months ago

@ppigazzini

It occurred to me that the naked exception can still do damage if it occurs while finishing a run.

I have now put it in a try except block with a server message.

vdbergh commented 3 months ago

@vondele Thanks. Most of these are pretty normal except the SPSA ones. There are some SPSA bugs in the worker which need to be understood.

ppigazzini commented 3 months ago

PROD updated with https://github.com/official-stockfish/fishtest/commit/549313e88301712f226d3fca9c8aff07d947e5a3

vdbergh commented 3 months ago

Any occurrences of print(f"Error while deleting connection: {str(e)}") in the logs?

vondele commented 3 months ago

No, I don't think so.

vondele commented 3 months ago
$ sudo journalctl -u fishtest@6543 --since "10 hours ago"  | grep -v "Request_task: the server is currently" | grep -v "dead task: run: http" | grep -v "Failed_task: failur" | grep -v "Request_task: refresh queue"  | grep -v 'Validate_random_run: validated aggregated data in cache'

-- Logs begin at Mon 2023-11-27 14:51:35 UTC, end at Thu 2024-06-06 09:52:39 UTC. --
Jun 06 00:13:51 tests.stockfishchess.org pserve[8828]: Run 6660e49c6489614cdad14e29 was not auto-purged. Message: No bad workers.
Jun 06 00:19:15 tests.stockfishchess.org pserve[8828]: Run 6660e4b26489614cdad1526d was not auto-purged. Message: No bad workers.
Jun 06 02:46:12 tests.stockfishchess.org pserve[8828]: /api/beat: Invalid remote_addr for task 665d665cfd45fb0f907c7778/3161. From task: a.b.c.d. From request: e.f.g.h
Jun 06 02:48:12 tests.stockfishchess.org pserve[8828]: /api/beat: Invalid remote_addr for task 665d665cfd45fb0f907c7778/3161. From task: a.b.c.d. From request: e.f.g.h.
Jun 06 02:48:54 tests.stockfishchess.org pserve[8828]: Update_task: task 6660ea1b6489614cdad1529c/1205 is not active
Jun 06 02:49:31 tests.stockfishchess.org pserve[8828]: /api/update_task: Invalid remote_addr for task 665d665cfd45fb0f907c7778/3161. From task: a.b.c.d. From request: e.f.g.h.
Jun 06 02:49:32 tests.stockfishchess.org pserve[8828]: /api/failed_task: Invalid remote_addr for task 665d665cfd45fb0f907c7778/3161. From task: a.b.c.d. From request: e.f.g.h.
Jun 06 03:08:02 tests.stockfishchess.org pserve[8828]: Run 6661267391e372763104b411 was not auto-purged. Message: No bad workers.
Jun 06 03:25:53 tests.stockfishchess.org pserve[8828]: Run 6661283c91e372763104b419 was not auto-purged. Message: No bad workers.
Jun 06 05:00:29 tests.stockfishchess.org pserve[8828]: Request_task: There is already a worker running with name "technologov-32cores-amd3990x-8461" which 108 seconds ago sent an update for task 6660ebb86489614cdad152a6/899 (my name is "technologov-32cores-amd3990x-2da4")
vdbergh commented 3 months ago
Jun 06 02:46:12 tests.stockfishchess.org pserve[8828]: /api/beat: Invalid remote_addr for task 665d665cfd45fb0f907c7778/3161. From task: a.b.c.d. From request: e.f.g.h
Jun 06 02:48:12 tests.stockfishchess.org pserve[8828]: /api/beat: Invalid remote_addr for task 665d665cfd45fb0f907c7778/3161. From task: a.b.c.d. From request: e.f.g.h.

So ip address changes during the execution of a task seem to be happening.

vondele commented 3 months ago

yes, the four instances of the warning are from the same ip 'switch' and make sense beat, beat, update_task, failed_task.

looking at the IPs the switch makes some sense as well (i.e. geo-located to the same region).