official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
270 stars 126 forks source link

Split `flush_buffers()` #2048

Closed vdbergh closed 1 month ago

vdbergh commented 1 month ago

This is a PR on top of #2047

Split the convoluted flush_buffers() into three scheduled tasks

This PR should also fix the deadlock #2041.

ppigazzini commented 1 month ago

Rebased on master after merging #2047 to fix the conflict.

ppigazzini commented 1 month ago

Looks good on DEV, testing on PROD now.

vdbergh commented 1 month ago

It seems dead tasks are no longer scavenged.

Unfortunately I will be afk till tonight.

ppigazzini commented 1 month ago

The last dead tasks in the log 3 hours ago

fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "3 hours ago" | grep dead
Jun 01 08:36:23 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 789 worker: AndreasKrug-1cores-ea30eef7-1aaf
Jun 01 08:36:35 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 1629 worker: technologov-28cores-r314-7a2e
Jun 01 08:37:02 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9821062b2c3cf814fed8 task_id: 1415 worker: technologov-28cores-r306-e3ea
Jun 01 08:37:04 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1313 worker: technologov-28cores-r309-5bed
Jun 01 08:37:27 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665ada260223e235f05b7ce3 task_id: 18 worker: technologov-28cores-r308-72c0
Jun 01 08:37:36 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa743062b2c3cf814ff47 task_id: 2126 worker: technologov-28cores-r307-02fe
Jun 01 08:37:37 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1091 worker: technologov-28cores-r302-0374
Jun 01 08:37:37 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1553 worker: technologov-28cores-r303-41a1
Jun 01 08:37:38 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa4e1062b2c3cf814ff2c task_id: 1147 worker: technologov-28cores-r301-8b0f
Jun 01 08:37:54 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa4e1062b2c3cf814ff2c task_id: 1502 worker: technologov-28cores-r304-5a01
Jun 01 08:37:55 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665accd50223e235f05b7ca3 task_id: 487 worker: technologov-28cores-r300-1cab
Jun 01 08:38:29 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 901 worker: technologov-28cores-r305-855d
Jun 01 08:38:29 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 2027 worker: technologov-28cores-r310-cc32
Jun 01 08:39:01 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9ec2062b2c3cf814ff02 task_id: 600 worker: tolkki963-2cores-8b0a6e2c-0a2a
Jun 01 08:39:06 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 921 worker: tolkki963-2cores-5697b53c-fe0d
Jun 01 08:40:44 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 824 worker: tolkki963-2cores-dd850aff-7edf
fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep dead
ppigazzini commented 1 month ago

It seems that the scavenge of dead tasks was working with #2045 and #2046 but stopped with #2047

ppigazzini commented 1 month ago

I restarted fishtest and this triggered a dead tasks purge. https://tests.stockfishchess.org/actions?max_actions=1&action=dead_task&user=&text=&before=1717240903.881336&run_id=

vdbergh commented 1 month ago

Are there any exceptions in the server log?

ppigazzini commented 1 month ago

No

ppigazzini commented 1 month ago

I'm reloading the DB on DEV to have many dead tasks to purge :)

ppigazzini commented 1 month ago

Dead tesks purge on DEV looks fine with master and this PR after reloading the DB

Jun 01 13:48:55 dfts-0 systemd[1]: Stopping Fishtest Server port 6543...
Jun 01 13:48:55 dfts-0 pserve[5465]: flush
Jun 01 13:48:55 dfts-0 pserve[5465]: done
Jun 01 13:49:00 dfts-0 systemd[1]: Stopped Fishtest Server port 6543.
Jun 01 13:49:34 dfts-0 systemd[1]: Started Fishtest Server port 6543.
Jun 01 13:49:37 dfts-0 pserve[6303]: Starting server in PID 6303.
Jun 01 13:50:07 dfts-0 pserve[6303]: Request_task: no useful cached runs left
Jun 01 13:50:07 dfts-0 pserve[6303]: Request_task: refresh queue
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 264 worker: technologov-28cores-r301-8b0f
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 334 worker: vdv-7cores-473c6083-3c48*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 335 worker: vdv-7cores-80cd7727-9c82*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 336 worker: vdv-7cores-fa3ade53-c640*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 337 worker: vdv-7cores-09497247-a19f*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 338 worker: vdv-7cores-99126dee-bd34*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 339 worker: vdv-7cores-3c7a73a0-b78b*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 340 worker: vdv-7cores-1c7e40a2-852e*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 341 worker: vdv-7cores-d8fc0606-d410*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 342 worker: vdv-7cores-c3bfdbb7-45fb*
ppigazzini commented 1 month ago

Dead tasks purge seems fine on PROD as well, there are new entries:

fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep dead
Jun 01 11:21:43 tests.stockfishchess.org pserve[17770]: dead task: run: https://tests.stockfishchess.org/tests/view/665af82c69d88d67b62fd9a7 task_id: 159 worker: technologov-28cores-r310-b4cc
Jun 01 11:23:49 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665afdbf45962393aa9c1c6b task_id: 6 worker: vdbergh-1cores-22f29ecf-9f53
Jun 01 11:36:49 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ad97f0223e235f05b7ccc task_id: 1113 worker: technologov-28cores-r306-7901
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 469 worker: technologov-28cores-r303-3b39
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1004 worker: technologov-28cores-r300-fba5
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1024 worker: technologov-28cores-r307-9032
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1034 worker: technologov-28cores-r309-aaac
vdbergh commented 1 month ago

Are the validated random run messages still in the logs? Just checking that the scheduler had not stopped.

ppigazzini commented 1 month ago
fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep validate
Jun 01 09:58:05 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 10:01:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aa766062b2c3cf814ff49...
Jun 01 10:04:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aeee00223e235f05b7d21...
Jun 01 10:07:05 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 10:10:10 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:13:11 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a7eea062b2c3cf814fe40...
Jun 01 10:16:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aa766062b2c3cf814ff49...
Jun 01 10:19:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae4010223e235f05b7d0b...
Jun 01 10:21:55 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:25:01 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:28:10 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae6320223e235f05b7d14...
Jun 01 10:31:06 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ada260223e235f05b7ce3...
Jun 01 10:34:09 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae4010223e235f05b7d0b...
Jun 01 10:37:01 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:40:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:43:11 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:46:20 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae7660223e235f05b7d17...
Jun 01 10:49:15 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 10:53:58 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 10:57:03 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:00:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:03:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ae6320223e235f05b7d14...
Jun 01 11:06:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665afdbf45962393aa9c1c6b...
Jun 01 11:09:15 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 11:12:13 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9821062b2c3cf814fed8...
Jun 01 11:15:10 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a64b0062b2c3cf814fdab...
Jun 01 11:18:17 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:21:26 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 11:24:51 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b04d38da109e362924e2a...
Jun 01 11:27:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 11:30:36 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:33:41 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b00cd45962393aa9c1c75...
Jun 01 11:36:43 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:39:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ad97f0223e235f05b7ccc...
Jun 01 11:42:42 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae1a30223e235f05b7cfd...
Jun 01 11:45:39 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 11:48:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 11:51:41 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:54:39 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b04f78da109e362924e2d...
Jun 01 11:57:31 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
vdbergh commented 1 month ago

Hmm... seems normal...

vdbergh commented 1 month ago

Dead task collection seems to be working now...

ppigazzini commented 1 month ago

I'm waiting that the DB backup is finished to stop fishtest for 10 minutes and tidy up the runs collection, that should create many dead tasks.

vondele commented 1 month ago

I've done a sampling under the current load, and I have this profile:

===============================================================
Collecting 600 samples for pserve 6543 spaced 0.5 seconds apart
Total time needed 300 seconds 
Found pid 22791 for pserve 6543
Showing 10 most frequent samples traces: 
217 
     File "util.py", line 749, in __next_schedule self.__event.wait(delay) 
16 
     File "api.py", line 237, in update_task self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 417, in get_run with self.run_cache_lock: 
13 
     File "util.py", line 751, in __next_schedule next_task.do_work() 
          File "util.py", line 634, in do_work self.worker(*self.args, *self.kwargs) 
               File "rundb.py", line 497, in flush_buffers self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run) 
12 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 802, in request_task with self.request_task_lock: 
11 
     File "api.py", line 308, in beat self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 417, in get_run with self.run_cache_lock: 
7 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 916, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 47, in worker_name name = "{}-{}cores".format(username, cores) 
6 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 905, in sync_request_task (run, task_id, task) 
5 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 916, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 41, in worker_name def worker_name(worker_info, short=False): 
4 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 947, in sync_request_task if connections >= connections_limit: 
4 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 916, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 44, in worker_name cores = str(worker_info["concurrency"])

One of the question I have is how long the 'flush_buffers' call actually takes right now. To understand if this might be a significant part of the server load, and if we should pick a schedule that is different from 1/s. Would it be possible to instrument the server to print the time taken by exactly that function ?

ppigazzini commented 1 month ago

done

diff --git a/server/fishtest/rundb.py b/server/fishtest/rundb.py
index 2e9814b2..007f4311 100644
--- a/server/fishtest/rundb.py
+++ b/server/fishtest/rundb.py
@@ -481,6 +481,7 @@ class RunDb:
     # For documentation of the cache format see "cache_schema" in schemas.py.

     def flush_buffers(self):
+        now = time.time()
         oldest_entry = None
         old = float("inf")
         with self.run_cache_lock:
@@ -495,6 +496,7 @@ class RunDb:
                 oldest_entry["last_sync_time"] = time.time()
                 with self.run_cache_write_lock:
                     self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run)
+        print(f"flush_buffers: {time.time() - now}", flush=True)

     def clean_cache(self):
         now = time.time()
Jun 01 19:22:56 tests.stockfishchess.org systemd[1]: Started Fishtest Server port 6543.
Jun 01 19:22:59 tests.stockfishchess.org pserve[25750]: Starting server in PID 25750.
Jun 01 19:22:59 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.1885673999786377
Jun 01 19:23:00 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.01155710220336914
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.02557539939880371
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: no useful cached runs left
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: refresh queue
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:02 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.031136274337768555
Jun 01 19:23:03 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.020483016967773438
Jun 01 19:23:03 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:04 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.17199039459228516
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.016495466232299805
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:06 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.013139009475708008
Jun 01 19:23:07 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.007906436920166016
Jun 01 19:23:08 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.020139694213867188
Jun 01 19:23:09 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:09 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.0214385986328125
Jun 01 19:23:10 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.06426382064819336
vondele commented 1 month ago

OK, so probably 0.02s regular, and 0.17s if waiting on a lock from request_task. So every second is kind of reasonable.

ppigazzini commented 1 month ago

I added the missing s, flush_buffers

vondele commented 1 month ago

Little histogram

image

image

it actually has a bit of a tail.

Actual tail:

$ sudo journalctl -u fishtest@6543 --since "30 minutes ago" | grep flush_b | awk '{print $NF}'  | sort -n | tail
0.3540463447570801
0.3559889793395996
0.3574235439300537
0.376375675201416
0.38812804222106934
0.3957796096801758
0.4064204692840576
0.44736409187316895
0.5016348361968994
0.6342136859893799

I would be a bit worried if the tail would exceed 1s, in that case we have probably a new flush_buffers scheduled before the old one finishes. This is actually were I think it would make sense to schedule the next call, 1s after the previous one finishes, rather than every 1s. Alternatively, we extend a little the window to e.g. 2 or 5s.

vdbergh commented 1 month ago

I would be a bit worried if the tail would exceed 1s, in that case we have probably a new flush_buffers scheduled before the old one finishes.

The new flush_buffers would actually be executed right after the previous one, not parallel with it.

It would be simple though to optionally not count the execution time of the task during scheduling. I was looking for a good name for that option.

vdbergh commented 1 month ago

@ppigazzini

I have been afk for the better part of the day.

Are you saying that in fact there is no problem with the collection of dead tasks in the PR?

ppigazzini commented 1 month ago

From logs and instrumented code the scavenge of dead tasks seems fine.

vdbergh commented 1 month ago

I created this dead task with a worker https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646?show_task=902 (at the time of posting, last update 10min ago).

The dead task appears not to be collected.

vdbergh commented 1 month ago

I think the PR has a race condition (which you also pointed out): setting is_changed=True and then making the change outside the lock.

I don't see how to trigger this, but who knows...

The correct approach is to reacquire the lock and then set is_changed is True.

ppigazzini commented 1 month ago

I analyzed a snippet of log and the number of dead tasks deleted was equal to the internal count (before the release of the lock). But it seems that sometimes the code can be racy.

vdbergh commented 1 month ago

I will update the PR, removing the race condition.

vdbergh commented 1 month ago

My dead task was finally collected https://tests.stockfishchess.org/actions?max_actions=1&action=dead_task&user=&text=vdbergh&before=1717274816.783141&run_id=

It took much longer than 6min though. Perhaps this is an artifact of the high server load?

ppigazzini commented 1 month ago

It was scavenged in this block:

Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: start
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: get lock
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: run_cache count: 46
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: release lock
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead_tasks_count=524
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 252 worker: vdv-7cores-e6adc6b1-5163*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 254 worker: vdv-7cores-5e50707f-c4d2*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 257 worker: vdv-7cores-4b019af9-c187*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 259 worker: vdv-7cores-1456a905-8a5a*
...
Jun 01 20:46:56 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 902 worker: vdbergh-1cores-22f29ecf-acb7
vdbergh commented 1 month ago

I think I understand what the problem is. I was only scavenging clean cache entries. However under heavy load a cache entry will almost never be clean. So it will not be seen by the scavenging task.

Fixed in the last push (I also removed the race condition).

ppigazzini commented 1 month ago

PROD updated with https://github.com/official-stockfish/fishtest/pull/2048/commits/64874618d8eab314d51341c3b091d627e1e58401

vdbergh commented 1 month ago

Problem fixed it seems :)

https://tests.stockfishchess.org/actions?max_actions=1&action=dead_task&user=&text=&before=1717277250.57244&run_id=

vdbergh commented 1 month ago

It would be simple though to optionally not count the execution time of the task during scheduling. I was looking for a good name for that option.

I think I will call it stretch (i.e. stretch the period with the duration of the task). It would be a useful option for tasks that are scheduled with a short period.

vondele commented 1 month ago

It would be simple though to optionally not count the execution time of the task during scheduling. I was looking for a good name for that option.

I think I will call it stretch (i.e. stretch the period with the duration of the task). It would be a useful option for tasks that are scheduled with a short period.

if it is a bool, 'guaranteeWait' or 'enforceGap' or similar.

vdbergh commented 1 month ago

@vondele Could you perhaps also calculate the average duration of flush_buffers()?

vondele commented 1 month ago
# total number of data points = 1799
# minimum of the data = 0.003052
# maximum of the data = 0.634214
# average of the data = 0.063830
# Binned median  = 0.045000
# Binned average = 0.063883
# Binned mode    = 0.035000
# binned variance   = 0.004938
# binned sigma      = 0.070271
# binned MAD        = 0.042839
# binned skew   = 2.963915
# binned kurt   = 10.158041
# general moments = 0.063830 0.009021 0.002240 0.000723 0.000271 0.000115 0.000054 0.000027 0.000015 0.000008 
# central moments = 0.000000 0.004947 0.001032 0.000322 0.000113 0.000045 0.000020 0.000010 0.000005 0.000003 
# invariant moments = 0.063830 0.004947 0.001032 0.000248 0.000062 0.000014 0.000002 -0.000001 -0.000001 -0.000001 
ppigazzini commented 1 month ago

Some data with 10k cores, based on 1969 measurements. It appears that the tail is not dependent on waiting for the lock.

diff --git a/server/fishtest/rundb.py b/server/fishtest/rundb.py
index 3097575b..e22dff5f 100644
--- a/server/fishtest/rundb.py
+++ b/server/fishtest/rundb.py
@@ -481,9 +481,11 @@ class RunDb:
     # For documentation of the cache format see "cache_schema" in schemas.py.

     def flush_buffers(self):
+        t_start = time.time()
         oldest_entry = None
         old = float("inf")
         with self.run_cache_lock:
+            t_lock = time.time()
             for cache_entry in self.run_cache.values():
                 run = cache_entry["run"]
                 if cache_entry["is_changed"] and cache_entry["last_sync_time"] < old:
@@ -495,6 +497,10 @@ class RunDb:
                 oldest_entry["last_sync_time"] = time.time()
                 with self.run_cache_write_lock:
                     self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run)
+        t_stop = time.time()
+        print(f"flush_buffers: total {t_stop - t_start}", flush=True)
+        print(f"flush_buffers: wait {t_lock - t_start}", flush=True)
+        print(f"flush_buffers: with {t_stop - t_lock}", flush=True)
total wait with
min 0.001279116 2.38419E-06 0.001270294
max 0.282125473 0.011865139 0.282114506
avg 0.018602283 1.91109E-05 0.018583172
stdev.p 0.034925572 0.000275271 0.034926888
median 0.007624388 1.04904E-05 0.00760293

image image image

vdbergh commented 1 month ago

Thanks for the stats. From the average duration, it appears that flush_buffers() does not load the system very much.

Also I guess that almost all time is spent writing to the db, during which the GIL is released.

vondele commented 1 month ago

trying with some heavier load on the system right now, I think this already improved the server, as the loops over the tasks in the scavenge function are now execute only every 60s.

ppigazzini commented 1 month ago

Code instrumented to view the waiting time for the lock under high load.

vondele commented 1 month ago

current stats for local workers:

request_spsa 1.25719
request_task 136.976
request_version 0.208234
update_task 3.55331
upload_pgn 32.1143
ppigazzini commented 1 month ago

last 1800 timings taken under high load (75k cores)

total wait
min 0.00143981 2.14577E-06
max 0.707242489 0.144560814
avg 0.06804863 0.000234107
stdev.p 0.066657078 0.004019203
median 0.048844218 7.15256E-06

image image

ppigazzini commented 1 month ago

last 3600 timings with high load (75k cores), max total time passed 1 second threshold

total wait
min 0.00143981 2.14577E-06
max 1.01291585 0.325762033
avg 0.07342107 0.00028337
stdev.p 0.079427364 0.006401669
median 0.049941778 7.15256E-06

image image