official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

High CPU load after fleet visit #1643

Open vondele opened 1 year ago

vondele commented 1 year ago

image The slowness appears to be due to update_task being 10x slower than before the fleet visit. Profiling the prod server we see the following profile:

--
Apr 28 20:23:08 tests pserve[30368]:    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.043    0.043 /home/fishtest/fishtest/server/env/lib/python3.11/site-packages/pyramid/viewderivers.py:426(rendered_view)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.042    0.042 /home/fishtest/fishtest/server/env/lib/python3.11/site-packages/pyramid/viewderivers.py:107(_class_requestonly_view)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.042    0.042 /home/fishtest/fishtest/server/fishtest/api.py:374(update_task)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.042    0.042 /home/fishtest/fishtest/server/fishtest/rundb.py:1013(update_task)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.042    0.042 /home/fishtest/fishtest/server/fishtest/rundb.py:1018(sync_update_task)
Apr 28 20:23:08 tests pserve[30368]:         1    0.027    0.027    0.041    0.041 /home/fishtest/fishtest/server/fishtest/rundb.py:560(get_results)
Apr 28 20:23:08 tests pserve[30368]:      5760    0.009    0.000    0.009    0.000 /home/fishtest/fishtest/server/fishtest/rundb.py:579(<listcomp>)
Apr 28 20:23:08 tests pserve[30368]:     11546    0.003    0.000    0.003    0.000 {method 'get' of 'dict' objects}
Apr 28 20:23:08 tests pserve[30368]:      5760    0.001    0.000    0.001    0.000 {method 'keys' of 'dict' objects}
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.001    0.001 /home/fishtest/fishtest/server/fishtest/stats/stat_util.py:195(update_SPRT)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.001    0.001 /home/fishtest/fishtest/server/fishtest/stats/LLRcalc.py:265(LLR_normalized)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.001    0.001 /home/fishtest/fishtest/server/fishtest/stats/LLRcalc.py:141(LLR)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.001    0.001 /home/fishtest/fishtest/server/fishtest/stats/LLRcalc.py:128(LLRjumps)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.001    0.001 /home/fishtest/fishtest/server/fishtest/stats/LLRcalc.py:132(<listcomp>)
Apr 28 20:23:08 tests pserve[30368]:         2    0.000    0.000    0.001    0.000 /home/fishtest/fishtest/server/fishtest/stats/LLRcalc.py:72(MLE_t_value)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.000    0.000 /home/fishtest/fishtest/server/fishtest/api.py:155(validate_request)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.000    0.000 /home/fishtest/fishtest/server/env/lib/python3.11/site-packages/pyramid/renderers.py:433(render_view)
Apr 28 20:23:08 tests pserve[30368]:         7    0.000    0.000    0.000    0.000 /home/fishtest/fishtest/server/fishtest/stats/LLRcalc.py:24(secular)
Apr 28 20:23:08 tests pserve[30368]:         1    0.000    0.000    0.000    0.000 /home/fishtest/fishtest/server/env/lib/python3.11/site-packages/pyramid/renderers.py:465(render_to_response)
Apr 28 20:23:08 tests pserve[30368]:         7    0.000    0.000    0.000    0.000 /home/fishtest/fishtest/server/env/lib/python3.11/site-packages/scipy/optimize/_zeros_py.py:658(brentq)

I.e. get_results is very expensive.

vondele commented 1 year ago

the cause is the iteration over all tasks in get_results. Probably update_task should result in some update_results call, i.e. must be able to execute without a full loop over all tasks.

Basically, the current scheme leads to a load on the server that is quadratic in the number of workers. The rate of update_task increasing linearly with the number of workers, and the task lists grow with the number of workers, the resulting load is quadratic.

Additionally, many failed workers (and longer tasks lists), make this problem even worse.

vdbergh commented 1 year ago

It should be easy to do it incrementally. Of course get_results is absolutely vital for the primary function of Fishtest so I would keep some sanity checks for a while (like computing the results from the task list when the run finishes and comparing them with the incremental results).

noobpwnftw commented 1 year ago

If I read this right, here the oldest run in run_cache gets scavenged and flushed every second if they receive any update? https://github.com/glinscott/fishtest/blob/6be811d340d87e0430d63af73a8daf50ca490b1d/server/fishtest/rundb.py#L403

vdbergh commented 1 year ago

I think so (Tom told me this a long time ago, I haven't read the code).

vdbergh commented 1 year ago

The idea is that the main page should see reasonably up date data.

noobpwnftw commented 1 year ago

Yeah, but it probably doesn't need scavenge that often?

noobpwnftw commented 1 year ago

https://tests.stockfishchess.org/tests/view/644bfb62f609e0c9f1579ec0

This page takes too long to load.

ppigazzini commented 1 year ago

https://tests.stockfishchess.org/tests/view/644bfb62f609e0c9f1579ec0

This page takes too long to load.

8243 tasks, 50% with 0 games played.

vdbergh commented 1 year ago

Is it caused by mongodb or by the mako rendering?

tomtor commented 1 year ago

Yes, that is true :)

It is busy on the stockfish mailing list these days.

If something is not clear, please ask and I'll try to answer.

Op za 29 apr. 2023 16:31 schreef vdbergh @.***>:

I think so (Tom told me this a long time ago, I haven't read the code).

— Reply to this email directly, view it on GitHub https://github.com/glinscott/fishtest/issues/1643#issuecomment-1528801799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAINTBENLI4PDLG3P75OOFTXDUQ4NANCNFSM6AAAAAAXPVT5GQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ppigazzini commented 1 year ago

Is it caused by mongodb or by the mako rendering?

Is should be the rendering. The stats page gets the whole run as well but it loads fast. https://tests.stockfishchess.org/tests/stats/644bfb62f609e0c9f1579ec0

https://github.com/glinscott/fishtest/blob/af3a98f8d1730d2cf055d35a7a83b59e2e44204b/server/fishtest/views.py#L1154-L1158

noobpwnftw commented 1 year ago

Production server hang with flood of /tests/view/644bfb62f609e0c9f1579ec0, as a workaround until experts fix it,

diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..6c4db73 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,9 @@ def tests_view(request):
     chi2 = get_chi2(run["tasks"])
     update_residuals(run["tasks"], cached_chi2=chi2)

+    run = copy.copy(run)
+    run["tasks"] = []
+
     try:
         show_task = int(request.params.get("show_task", -1))
     except:
vdbergh commented 1 year ago

I guess the solution should be to use pagination in the task list (e.g. using the example of the event log). It would be a bit of an extra hassle to make it compatibility with show_task.

vdbergh commented 1 year ago

Not related to this issue but there are some strange things in tests/view

https://github.com/glinscott/fishtest/blob/af3a98f8d1730d2cf055d35a7a83b59e2e44204b/server/fishtest/views.py#L1282-L1287

Why is last_updated changed when you view the test?

peregrineshahin commented 1 year ago

Not related to this issue but there are some strange things in tests/view

https://github.com/glinscott/fishtest/blob/af3a98f8d1730d2cf055d35a7a83b59e2e44204b/server/fishtest/views.py#L1282-L1287

Why is last_updated changed when you view the test?

That I don't understand, nor do I understand why it changes when "beat" although there were no games update and the lastupdated is used in machines calculations.

peregrineshahin commented 1 year ago

Production server hang with flood of /tests/view/644bfb62f609e0c9f1579ec0, as a workaround until experts fix it,

diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..6c4db73 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,9 @@ def tests_view(request):
     chi2 = get_chi2(run["tasks"])
     update_residuals(run["tasks"], cached_chi2=chi2)

+    run = copy.copy(run)
+    run["tasks"] = []
+
     try:
         show_task = int(request.params.get("show_task", -1))
     except:

your added lines assume the problem is in the rendering, are we really sure about that?

ppigazzini commented 1 year ago

Thank you noob! Not reviewed yet, still coffeeless. PROD updated on the fly with:

diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..3e6796f 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,10 @@ def tests_view(request):
     chi2 = get_chi2(run["tasks"])
     update_residuals(run["tasks"], cached_chi2=chi2)

+    if len(run["tasks"]) > 1000:
+        run = copy.deepcopy(run)
+        run["tasks"] = []
+
     try:
         show_task = int(request.params.get("show_task", -1))
     except:

EDIT_000: Switched to:

diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..3e6796f 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,10 @@ def tests_view(request):
     chi2 = get_chi2(run["tasks"])
     update_residuals(run["tasks"], cached_chi2=chi2)

+    run = copy.deepcopy(run)
+    run["tasks"] = []
+
     try:
         show_task = int(request.params.get("show_task", -1))
     except:

EDIT_001

Switched back to copy.copy(), PROD hangs with copy.deepcopy()

diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..3e6796f 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,10 @@ def tests_view(request):
     chi2 = get_chi2(run["tasks"])
     update_residuals(run["tasks"], cached_chi2=chi2)

+    run = copy.copy(run)
+    run["tasks"] = []
+
     try:
         show_task = int(request.params.get("show_task", -1))
     except:
noobpwnftw commented 1 year ago

I have tested with no external requests(turning nginx off), and waited for long enough to conclude that it will not get out of the hang by itself.

vondele commented 1 year ago

It should be easy to do it incrementally. Of course get_results is absolutely vital for the primary function of Fishtest so I would keep some sanity checks for a while (like computing the results from the task list when the run finishes and comparing them with the incremental results).

@vdbergh is that something you could look into? I had a first look, but didn't get to the point of writing code for it.

I agree that having a sanity check would be good. Either at the end, or even whenever the run is written to the db.

vdbergh commented 1 year ago

I will try, if it is not too urgent. This weekend I will be a bit busy.

xoto10 commented 1 year ago

I hadn't seen the comment from @vdbergh and decided to practice my python programming by having a go at this. My attempt can be seen at: https://github.com/xoto10/fishtest/commit/9987a27f8b40ce49ba036ba2afd40430e6852b10

vdbergh commented 1 year ago

There is little point for two people to work on this. So I closed my PR. Please make a PR.

vdbergh commented 1 year ago

@xoto10 When the run finishes I would verify that the result of the incremental updates is the same as the result of summing the tasks. Otherwise there is no way to know that the code behaves correctly when heavily stressed.