official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

get_machines is too slow and does not use the book keeping data #2101

Open peregrineshahin opened 4 months ago

peregrineshahin commented 4 months ago

https://tests.stockfishchess.org/tests/machines This route should use the latest optimizations of workers book-keeping if exists It's also too slow under fleet load, ignoring the 10s cached http resposne

    def get_machines(self):
        active_runs = self.runs.find({"finished": False}, {"tasks": 1, "args": 1})
        machines = (
            task["worker_info"]
            | {
                "last_updated": (
                    task["last_updated"] if task.get("last_updated") else None
                ),
                "run": run,
                "task_id": task_id,
            }
            for run in active_runs
            if any(task["active"] for task in reversed(run["tasks"]))
            for task_id, task in enumerate(run["tasks"])
            if task["active"]
        )
        return machines
vdbergh commented 4 months ago

Well there is an intrinsic complexity here: a list of all active tasks needs to be made.

On the primary instance the db access could be eliminated by using the cache and the set of unfinished runs (which is dynamically kept up to date), but I believe get_machines runs on a secondary instance.

What we could do (in addition perhaps) is to create a scheduled task to regenerate the machines list every minute and then serve this list.

peregrineshahin commented 4 months ago

there is an intrinsic complexity here: a list of all active tasks needs to be made.

how is that? the goal is to get the active machines, the "added" complexity that you need to loop through the tasks is an implementation detail..

vdbergh commented 4 months ago

Actually you are right. The main instance has a data structure wtt_map which is a dictionary which maps short worker names to tasks.

wtt_map_schema = {
    short_worker_name: (runs_schema, task_id),
}

So that could be used to get the machines list quickly on the main instance.

vdbergh commented 4 months ago

Either the machine list api should be moved to a primary instance, or else an internal api should be created where a secondary instance asks the primary instance for the list (this may involve serialization/deserialization of json). An example of such an api is in #1993.

vdbergh commented 4 months ago

And as I said, an additional enhancement could be to recreate the list every minute in a scheduled task (rather than creating it on demand).

vdbergh commented 4 months ago

Note that the wtt_map itself may contain some inactive tasks (avoiding this clutters the code too much IMHO). These inactive tasks are evicted in a scheduled task that runs every minute.

peregrineshahin commented 4 months ago

I think we moved this API to the secondary instance because of performance issues.. while I expected we have such a map in the flow that should provide us such information I'm not quite sure about the 1 min reschedule time build, do you mean that the numbers showing in the home page cards currently only renew every 1 min? This sounds too much TBH.. I'm not generally a fan of rescheduling api's.. neither does the nature of book-keeping approaches requires it.. so this caught me by surprise.

vdbergh commented 4 months ago

Well I do not know where the bottleneck is. It could also be in mako.

Currently the table is computed on demand but there is a delay (of the order of 30-60s) before the db is fully updated.

I like to schedule expensive computing tasks at regular intervals rather than on demand (unless accuracy is critical). In that way one get predictable performance independently of load.

Note that the new scheduler in Fishtest makes creating such tasks very easy.

peregrineshahin commented 4 months ago

Wel my point is if we look at book-keeping as a technique it is neither on demand or scheduled.. it none of both.. so my question was .. suppose each time a worker gets active you add it to the dict.. and a worker gets non active you remove it from the list.. I'm not sure I understand why would such a map require scheduling?

vdbergh commented 4 months ago

I do not know how much time it takes to generate the html from the data structure. If it takes 1s and 60 people want to view the machines lists then there is no time to do anything else...

EDIT: Simplified of course...

peregrineshahin commented 4 months ago

Sorry but I'm not following.. my question is related to the wttp_map, my question is why do we have it every 1 min.. I'm not understanding conceptually why such map doesn't exist always up to date just like how run results gets incrementally updated.. is it because we don't spot the worker while they leave?

vdbergh commented 4 months ago

The wtt_map is always up to date (except that it contains some inactive tasks which are periodically cleaned). But you still need to generate the html from it. That's what I am talking about (during a visit of Vondele's fleet this will typically be a table with 10000 rows). Of course you could fiddle with paging but then sorting by username becomes unpleasant (I think).

PS.

The reason why there are some inactive tasks is that I did not want to clutter the set_inactive_task() function with code to update the wtt_map. Besides such code being messy one would also need to be careful to avoid deadlock.

vdbergh commented 4 months ago

Perhaps I should clarify that the wtt_map was not created to support the machine list, but to very efficiently detect duplicate workers (even when Vondele's fleet is visiting).

For the machine list perhaps another data structure would be more appropriate: a dictionary with keys the run_id's of unfinished runs and with values the sets of corresponding active tasks. I.e. with schema

{
    run_id: {task_id}
}

This would be very easy to keep up to date.

peregrineshahin commented 4 months ago

I'm not sure where to look, perhaps it would be easier for you to code the new function please.

For the machine list perhaps another data structure would be more appropriate: a dictionary with keys the run_id's of unfinished runs and with values the sets of corresponding active tasks. I.e. with schema

{ run_id: {task_id} } This would be very easy to keep up to date.

vdbergh commented 4 months ago

The thing is that I am not 100% sure that generating the data is the bottleneck. I wonder if it is not the generation of a table with > 15000 rows from that data (a table that also needs to be transferred to the client).

If the latter would be the case then moving the machines api to the primary instance would be bad since there would less time to handle the worker apis.

vdbergh commented 4 months ago

I guess we could try. But it would require a fleet visit to see the effect.

dubslow commented 4 months ago

Good thing the fleet is visiting right this very moment :) (the server is even holding fairly steady at 100+k cores for two hours now!)

vdbergh commented 4 months ago

Well it takes more than 15min to implement this...

vdbergh commented 4 months ago

Good thing the fleet is visiting right this very moment :) (the server is even holding fairly steady at 100+k cores for two hours now!)

Yes the server seems to be doing fine. Local api.log times remain stable (a few ms/api call except pgn_upload which takes ~ 20ms).

vondele commented 4 months ago

Server is indeed doing fine under this load now. Current api locally:

# logging from [06/Jul/2024:09:52:14 +0000] to [06/Jul/2024:09:53:10 +0000] ########
# duration (seconds)            : 56
# calls in total                : 10000
# calls per second              : 178
# calls not reaching the backend: 1
# calls handled by the backend  : 9999
#                         route    calls      total    average    minimum    maximum maxcount
                      /api/beat     6810    744.814      0.109      0.001     30.001       17
               /api/update_task     3047    201.978      0.066      0.001     30.001        4
                         /tests       12      7.307      0.609      0.015      0.997        0
                  /tests/tasks/       12      6.094      0.508      0.292      1.007        0
                     /tests/run        2      3.938      1.969      1.180      2.758        0
                   /tests/view/       14      0.846      0.060      0.010      0.159        0
              /api/request_spsa       50      0.820      0.016      0.003      0.114        0
                  /api/get_elo/        7      0.478      0.068      0.017      0.184        0
                   /api/actions       11      0.199      0.018      0.003      0.037        0
                       /api/nn/       28      0.158      0.006      0.004      0.007        0
                /tests/finished        1      0.092      0.092      0.092      0.092        0
                 /tests/approve        1      0.054      0.054      0.054      0.054        0
               /api/active_runs        2      0.025      0.013      0.006      0.019        0
           /api/request_version        1      0.005      0.005      0.005      0.005        0
              /api/request_task        1      0.003      0.003      0.003      0.003        0

a bit earlier the 3 pgn_upload pserves were 100% loaded, but not hanging (eventually we should just reduce the amount of STC pgns we upload, IMO, e.g. only for the first N tasks or so). We still have these small bursts of api calls that take 30s or so, even though the average timing is very low. The primary pserve is around 50% (30-70%) loaded.

vdbergh commented 4 months ago

Hmm the worker seems to log only successful api calls (which take a few ms), not those that time out (~30s I guess).

vondele commented 4 months ago

Note that these are rare (17 out of 6810) for the table above.

vdbergh commented 4 months ago

It would be nice to understand this though.

I wonder if the dead tasks scavenging has something to do with it. If this scheduled task (which runs every minute) has to suddenly handle 1000s of dead tasks (which involves acquiring active_run_locks) then perhaps this may temporarily clog things up.

As a safety measure I am thinking of only scavenging a limited number of dead tasks per turn. E.g. 500. Then clearing 17000 dead tasks would take 34 minutes. This is not entirely free though because it means runs may take longer to finish. But it seems like a worthy trade off.

vdbergh commented 4 months ago

See #2109

peregrineshahin commented 4 months ago

pagination needed in both cases

The thing is that I am not 100% sure that generating the data is the bottleneck. I wonder if it is not the generation of a table with > 15000 rows from that data (a table that also needs to be transferred to the client).

If the latter would be the case then moving the machines api to the primary instance would be bad since there would less time to handle the worker apis.

vdbergh commented 4 months ago

pagination needed in both cases

Yes but one should still be able to sort the whole table (not only the current page).

peregrineshahin commented 4 months ago

Actually vondele suggested this one which I forgot about:

or a smarter way to show the table. E.g. summary first per user, only after a click expand that user? (effectively with 10k workers, just scrolling or clicking through pages is not very practical)

vdbergh commented 4 months ago

Sounds good. Sorting by username is indeed the only thing I do.

vondele commented 4 months ago

I actually do sort per core count, compiler version, python version etc... So maybe the grouping should be 'per sorted item' and on a click expand that? In this case we have all functionality.

peregrineshahin commented 4 months ago

I actually do sort per core count, compiler version, python version etc... So maybe the grouping should be 'per sorted item' and on a click expand that? In this case we have all functionality.

Yeah doable.. will work on it