Open peregrineshahin opened 4 months ago
Well there is an intrinsic complexity here: a list of all active tasks needs to be made.
On the primary instance the db access could be eliminated by using the cache and the set of unfinished runs (which is dynamically kept up to date), but I believe get_machines runs on a secondary instance.
What we could do (in addition perhaps) is to create a scheduled task to regenerate the machines list every minute and then serve this list.
there is an intrinsic complexity here: a list of all active tasks needs to be made.
how is that? the goal is to get the active machines, the "added" complexity that you need to loop through the tasks is an implementation detail..
Actually you are right. The main instance has a data structure wtt_map
which is a dictionary which maps short worker names to tasks.
wtt_map_schema = {
short_worker_name: (runs_schema, task_id),
}
So that could be used to get the machines list quickly on the main instance.
Either the machine list api should be moved to a primary instance, or else an internal api should be created where a secondary instance asks the primary instance for the list (this may involve serialization/deserialization of json). An example of such an api is in #1993.
And as I said, an additional enhancement could be to recreate the list every minute in a scheduled task (rather than creating it on demand).
Note that the wtt_map
itself may contain some inactive tasks (avoiding this clutters the code too much IMHO). These inactive tasks are evicted in a scheduled task that runs every minute.
I think we moved this API to the secondary instance because of performance issues.. while I expected we have such a map in the flow that should provide us such information I'm not quite sure about the 1 min reschedule time build, do you mean that the numbers showing in the home page cards currently only renew every 1 min? This sounds too much TBH.. I'm not generally a fan of rescheduling api's.. neither does the nature of book-keeping approaches requires it.. so this caught me by surprise.
Well I do not know where the bottleneck is. It could also be in mako.
Currently the table is computed on demand but there is a delay (of the order of 30-60s) before the db is fully updated.
I like to schedule expensive computing tasks at regular intervals rather than on demand (unless accuracy is critical). In that way one get predictable performance independently of load.
Note that the new scheduler in Fishtest makes creating such tasks very easy.
Wel my point is if we look at book-keeping as a technique it is neither on demand or scheduled.. it none of both.. so my question was .. suppose each time a worker gets active you add it to the dict.. and a worker gets non active you remove it from the list.. I'm not sure I understand why would such a map require scheduling?
I do not know how much time it takes to generate the html from the data structure. If it takes 1s and 60 people want to view the machines lists then there is no time to do anything else...
EDIT: Simplified of course...
Sorry but I'm not following.. my question is related to the wttp_map, my question is why do we have it every 1 min.. I'm not understanding conceptually why such map doesn't exist always up to date just like how run results gets incrementally updated.. is it because we don't spot the worker while they leave?
The wtt_map is always up to date (except that it contains some inactive tasks which are periodically cleaned). But you still need to generate the html from it. That's what I am talking about (during a visit of Vondele's fleet this will typically be a table with 10000 rows). Of course you could fiddle with paging but then sorting by username becomes unpleasant (I think).
PS.
The reason why there are some inactive tasks is that I did not want to clutter the set_inactive_task()
function with code to update the wtt_map
. Besides such code being messy one would also need to be careful to avoid deadlock.
Perhaps I should clarify that the wtt_map
was not created to support the machine list, but to very efficiently detect duplicate workers (even when Vondele's fleet is visiting).
For the machine list perhaps another data structure would be more appropriate: a dictionary with keys the run_id's of unfinished runs and with values the sets of corresponding active tasks. I.e. with schema
{
run_id: {task_id}
}
This would be very easy to keep up to date.
I'm not sure where to look, perhaps it would be easier for you to code the new function please.
For the machine list perhaps another data structure would be more appropriate: a dictionary with keys the run_id's of unfinished runs and with values the sets of corresponding active tasks. I.e. with schema
{ run_id: {task_id} } This would be very easy to keep up to date.
The thing is that I am not 100% sure that generating the data is the bottleneck. I wonder if it is not the generation of a table with > 15000 rows from that data (a table that also needs to be transferred to the client).
If the latter would be the case then moving the machines api to the primary instance would be bad since there would less time to handle the worker apis.
I guess we could try. But it would require a fleet visit to see the effect.
Good thing the fleet is visiting right this very moment :) (the server is even holding fairly steady at 100+k cores for two hours now!)
Well it takes more than 15min to implement this...
Good thing the fleet is visiting right this very moment :) (the server is even holding fairly steady at 100+k cores for two hours now!)
Yes the server seems to be doing fine. Local api.log
times remain stable (a few ms/api call except pgn_upload
which takes ~ 20ms).
Server is indeed doing fine under this load now. Current api locally:
# logging from [06/Jul/2024:09:52:14 +0000] to [06/Jul/2024:09:53:10 +0000] ########
# duration (seconds) : 56
# calls in total : 10000
# calls per second : 178
# calls not reaching the backend: 1
# calls handled by the backend : 9999
# route calls total average minimum maximum maxcount
/api/beat 6810 744.814 0.109 0.001 30.001 17
/api/update_task 3047 201.978 0.066 0.001 30.001 4
/tests 12 7.307 0.609 0.015 0.997 0
/tests/tasks/ 12 6.094 0.508 0.292 1.007 0
/tests/run 2 3.938 1.969 1.180 2.758 0
/tests/view/ 14 0.846 0.060 0.010 0.159 0
/api/request_spsa 50 0.820 0.016 0.003 0.114 0
/api/get_elo/ 7 0.478 0.068 0.017 0.184 0
/api/actions 11 0.199 0.018 0.003 0.037 0
/api/nn/ 28 0.158 0.006 0.004 0.007 0
/tests/finished 1 0.092 0.092 0.092 0.092 0
/tests/approve 1 0.054 0.054 0.054 0.054 0
/api/active_runs 2 0.025 0.013 0.006 0.019 0
/api/request_version 1 0.005 0.005 0.005 0.005 0
/api/request_task 1 0.003 0.003 0.003 0.003 0
a bit earlier the 3 pgn_upload pserves were 100% loaded, but not hanging (eventually we should just reduce the amount of STC pgns we upload, IMO, e.g. only for the first N tasks or so). We still have these small bursts of api calls that take 30s or so, even though the average timing is very low. The primary pserve is around 50% (30-70%) loaded.
Hmm the worker seems to log only successful api calls (which take a few ms), not those that time out (~30s I guess).
Note that these are rare (17 out of 6810) for the table above.
It would be nice to understand this though.
I wonder if the dead tasks
scavenging has something to do with it. If this scheduled task (which runs every minute) has to suddenly handle 1000s of dead tasks (which involves acquiring active_run_lock
s) then perhaps this may temporarily clog things up.
As a safety measure I am thinking of only scavenging a limited number of dead tasks per turn. E.g. 500. Then clearing 17000 dead tasks would take 34 minutes. This is not entirely free though because it means runs may take longer to finish. But it seems like a worthy trade off.
See #2109
pagination needed in both cases
The thing is that I am not 100% sure that generating the data is the bottleneck. I wonder if it is not the generation of a table with > 15000 rows from that data (a table that also needs to be transferred to the client).
If the latter would be the case then moving the machines api to the primary instance would be bad since there would less time to handle the worker apis.
pagination needed in both cases
Yes but one should still be able to sort the whole table (not only the current page).
Actually vondele suggested this one which I forgot about:
or a smarter way to show the table. E.g. summary first per user, only after a click expand that user? (effectively with 10k workers, just scrolling or clicking through pages is not very practical)
Sounds good. Sorting by username is indeed the only thing I do.
I actually do sort per core count, compiler version, python version etc... So maybe the grouping should be 'per sorted item' and on a click expand that? In this case we have all functionality.
I actually do sort per core count, compiler version, python version etc... So maybe the grouping should be 'per sorted item' and on a click expand that? In this case we have all functionality.
Yeah doable.. will work on it
https://tests.stockfishchess.org/tests/machines This route should use the latest optimizations of workers book-keeping if exists It's also too slow under fleet load, ignoring the 10s cached http resposne