Open vondele opened 1 year ago
the cause is the iteration over all tasks in get_results. Probably update_task should result in some update_results call, i.e. must be able to execute without a full loop over all tasks.
Basically, the current scheme leads to a load on the server that is quadratic in the number of workers. The rate of update_task increasing linearly with the number of workers, and the task lists grow with the number of workers, the resulting load is quadratic.
Additionally, many failed workers (and longer tasks lists), make this problem even worse.
It should be easy to do it incrementally. Of course get_results
is absolutely vital for the primary function of Fishtest so I would keep some sanity checks for a while (like computing the results from the task list when the run finishes and comparing them with the incremental results).
If I read this right, here the oldest run in run_cache
gets scavenged and flushed every second if they receive any update?
https://github.com/glinscott/fishtest/blob/6be811d340d87e0430d63af73a8daf50ca490b1d/server/fishtest/rundb.py#L403
I think so (Tom told me this a long time ago, I haven't read the code).
The idea is that the main page should see reasonably up date data.
Yeah, but it probably doesn't need scavenge that often?
https://tests.stockfishchess.org/tests/view/644bfb62f609e0c9f1579ec0
This page takes too long to load.
https://tests.stockfishchess.org/tests/view/644bfb62f609e0c9f1579ec0
This page takes too long to load.
8243 tasks, 50% with 0 games played.
Is it caused by mongodb or by the mako rendering?
Yes, that is true :)
It is busy on the stockfish mailing list these days.
If something is not clear, please ask and I'll try to answer.
Op za 29 apr. 2023 16:31 schreef vdbergh @.***>:
I think so (Tom told me this a long time ago, I haven't read the code).
— Reply to this email directly, view it on GitHub https://github.com/glinscott/fishtest/issues/1643#issuecomment-1528801799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAINTBENLI4PDLG3P75OOFTXDUQ4NANCNFSM6AAAAAAXPVT5GQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Is it caused by mongodb or by the mako rendering?
Is should be the rendering. The stats page gets the whole run
as well but it loads fast.
https://tests.stockfishchess.org/tests/stats/644bfb62f609e0c9f1579ec0
Production server hang with flood of /tests/view/644bfb62f609e0c9f1579ec0
, as a workaround until experts fix it,
diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..6c4db73 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,9 @@ def tests_view(request):
chi2 = get_chi2(run["tasks"])
update_residuals(run["tasks"], cached_chi2=chi2)
+ run = copy.copy(run)
+ run["tasks"] = []
+
try:
show_task = int(request.params.get("show_task", -1))
except:
I guess the solution should be to use pagination in the task list (e.g. using the example of the event log). It would be a bit of an extra hassle to make it compatibility with show_task.
Not related to this issue but there are some strange things in tests/view
Why is last_updated
changed when you view the test?
Not related to this issue but there are some strange things in tests/view
Why is
last_updated
changed when you view the test?
That I don't understand, nor do I understand why it changes when "beat" although there were no games update and the lastupdated is used in machines calculations.
Production server hang with flood of
/tests/view/644bfb62f609e0c9f1579ec0
, as a workaround until experts fix it,diff --git a/server/fishtest/views.py b/server/fishtest/views.py index 8b79f86..6c4db73 100644 --- a/server/fishtest/views.py +++ b/server/fishtest/views.py @@ -1290,6 +1290,9 @@ def tests_view(request): chi2 = get_chi2(run["tasks"]) update_residuals(run["tasks"], cached_chi2=chi2) + run = copy.copy(run) + run["tasks"] = [] + try: show_task = int(request.params.get("show_task", -1)) except:
your added lines assume the problem is in the rendering, are we really sure about that?
Thank you noob! Not reviewed yet, still coffeeless. PROD updated on the fly with:
diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..3e6796f 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,10 @@ def tests_view(request):
chi2 = get_chi2(run["tasks"])
update_residuals(run["tasks"], cached_chi2=chi2)
+ if len(run["tasks"]) > 1000:
+ run = copy.deepcopy(run)
+ run["tasks"] = []
+
try:
show_task = int(request.params.get("show_task", -1))
except:
EDIT_000: Switched to:
diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..3e6796f 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,10 @@ def tests_view(request):
chi2 = get_chi2(run["tasks"])
update_residuals(run["tasks"], cached_chi2=chi2)
+ run = copy.deepcopy(run)
+ run["tasks"] = []
+
try:
show_task = int(request.params.get("show_task", -1))
except:
EDIT_001
Switched back to copy.copy()
, PROD hangs with copy.deepcopy()
diff --git a/server/fishtest/views.py b/server/fishtest/views.py
index 8b79f86..3e6796f 100644
--- a/server/fishtest/views.py
+++ b/server/fishtest/views.py
@@ -1290,6 +1290,10 @@ def tests_view(request):
chi2 = get_chi2(run["tasks"])
update_residuals(run["tasks"], cached_chi2=chi2)
+ run = copy.copy(run)
+ run["tasks"] = []
+
try:
show_task = int(request.params.get("show_task", -1))
except:
I have tested with no external requests(turning nginx off), and waited for long enough to conclude that it will not get out of the hang by itself.
It should be easy to do it incrementally. Of course
get_results
is absolutely vital for the primary function of Fishtest so I would keep some sanity checks for a while (like computing the results from the task list when the run finishes and comparing them with the incremental results).
@vdbergh is that something you could look into? I had a first look, but didn't get to the point of writing code for it.
I agree that having a sanity check would be good. Either at the end, or even whenever the run is written to the db.
I will try, if it is not too urgent. This weekend I will be a bit busy.
I hadn't seen the comment from @vdbergh and decided to practice my python programming by having a go at this. My attempt can be seen at: https://github.com/xoto10/fishtest/commit/9987a27f8b40ce49ba036ba2afd40430e6852b10
There is little point for two people to work on this. So I closed my PR. Please make a PR.
@xoto10 When the run finishes I would verify that the result of the incremental updates is the same as the result of summing the tasks. Otherwise there is no way to know that the code behaves correctly when heavily stressed.
The slowness appears to be due to update_task being 10x slower than before the fleet visit. Profiling the prod server we see the following profile:
I.e. get_results is very expensive.