official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

fishtest at 40k cores #553

Closed vondele closed 2 years ago

vondele commented 4 years ago

@ppigazzini @tomtor right now might be a good moment to gather some statistics on the fishtest performance under 40000 cores load. It seems more or less OK, but suffering a bit.

noobpwnftw commented 4 years ago

The queue seems very long but I think I cannot add more workers without causing more problems. Is it possible to produce a waterfall graph given current load to identify bottlenecks?

xoto10 commented 4 years ago

I think it's doing ok. People have been creating a lot of tests, and my LTC has a quarter of the cores to itself and has cranked out 136k LTC games in a short time. That would take a day or more normally.

xoto10 commented 4 years ago

I think there are too many cores again (20000+ !).

I timed an STC job for 1 minute and it increased by 300 games when I would expect about 1500. I imagine most workers are failing to update fishtest because of the overloading.

xoto10 commented 4 years ago

Now ~25k cores. STC job with 760 cores got 280 games in a minute, it should get ~1500 (2*760). i.e. only 20% of expected throughput. fishtest is going slower with all these cores, not quicker, and it may crash again

noobpwnftw commented 4 years ago

Yes, not going add any more cores until the server can handle their work.

xoto10 commented 4 years ago

I think it can handle ~10k cores, especially if there are several LTC jobs running like now, but somewhere above that it overloads the server and then the throughput goes down.

I'm debugging my patch to display the games/minute correctly. When that is working and in prod, we will at least be able to see a realistic figure for the throughput.

linrock commented 4 years ago

if the /api/update_task call from the worker fleet is what's causing the CPU bottleneck, for starters, i'd suggest routing all requests to that API call to a dedicated pserve via nginx.

in that nginx config, /api and /tests are both handled by the same hammered pserve. the /tests pages can be CPU-intensive. giving /api/update_task its own pserve spreads the load.

so the nginx config could have these directives:

upstream backend_update_task {
    server 127.0.0.1:6545;
}

location /api/update_task {
    proxy_pass http://backend_update_task;
}

prod has 4 cores and only 2 pserves are running. each pserve can only run on one core, so adding a 3rd pserve distributes the CPU load a bit.

we can certainly figure out a way to remove that active_run_lock so these requests aren't bound to a single python process.


if you want to try an nginx hashing strategy for routing workers to pserves, you can have the workers send a header like X-Test-Run-ID with their requests, then use that header value as the hash key in nginx:

upstream backend_update_task {
    hash $http_x_test_run_id;
    server 127.0.0.1:6546;
    server 127.0.0.1:6547;
    server 127.0.0.1:6548;
}

of course, a misbehaving or malicious worker would cause a mess, so an ideal solution would remove the need for those locks and process-specific data structures. that way more pserves can transparently handle the load.

that means either centralizing the data shared between pserves (using redis or memcached) or updating the mongo document schema for runs/tasks so that the locks aren't needed at all.

tomtor commented 4 years ago

in that nginx config, /api and /tests are both handled by the same hammered pserve. the /tests pages can be CPU-intensive. giving /api/update_task its own pserve spreads the load.

The individual tests page need to access the same pserv as /api/update_task is directed to, to see the up-to-date data (which is cached in the pserv instance) for a specific test. This also applies to the live_elo gauge pages.

Distributing tests over more pserv instances is certainly possible, see also https://github.com/glinscott/fishtest/issues/553#issuecomment-609390005 but it will not solve the issue of having a single active test (the last one before the queues empties) and a lot of workers.

Storing the state of each test in a centralized location (Redis or Memcache or ...) which all Python pserv instances access is fine and not much work to add, but a scalable /api/update_task implementation (and /api/request_spsa) which can deal with many workers working on a single test is the interesting/challenging part. A concept implementation of that part with a proper high load test driver is needed first. I would like to write that in Rust, see e.g. https://rocket.rs/v0.4/guide/state/ but that's just my preference.

Using pypy to speed up without changing the current Python fishtest codebase is also still an option which could/should be tested:

https://github.com/glinscott/fishtest/pull/276#issuecomment-528894588

ppigazzini commented 4 years ago

@tomtor I will do a try to pypy on a VM and then on DEV https://snapcraft.io/install/pypy3/ubuntu

vondele commented 4 years ago

actually, is the single test case really relevant ? With the current limit if 250000 games per test and a batch size of 250, 'only' 1000 workers can be assigned anyway. The test will be over in minutes. If we can deal with the many workers - many tests case, we're covering the important case.

tomtor commented 4 years ago

actually, is the single test case really relevant ? With the current limit if 250000 games per test and a batch size of 250, 'only' 1000 workers can be assigned anyway. The test will be over in minutes. If we can deal with the many workers - many tests case, we're covering the important case.

Yes, you're right, there is no functional need to be able to handle > 10000 cores for a single test, but 1000 workers with 16 cores (16000 cores) might overload a single pserv instance. So we need to guard against that overload, either by limiting the number of allocated workers for a single test (this depends on the TC, long TC allows more workers) or by being able to handle such a large number.

With 10000 cores it is very hard to keep the queue filled with sensible tests anyway, but it is fun when we could handle (much) more. Perhaps in the future some low-priority auto-tuning background jobs could profit from a large number of cores.

Vizvezdenec commented 4 years ago

actually not directly related, but can we increase default test limit to 500000? With new STC bounds there were multiple cases where I needed to do so manually. It's not that common but also not extremely rare for test to surpass this number of games.

tomtor commented 4 years ago

actually not directly related, but can we increase default test limit to 500000?

I assume the impact will be limited when we have a low number of workers. This will however raise the impact of a running single test with a large number of workers, so it should be combined with a limit on the number of active workers on a single test. I will have a look at it...

tomtor commented 4 years ago

@linrock @Vizvezdenec @ppigazzini Raising the limit (400000 or 500000) could be added to #572 ?

Vizvezdenec commented 4 years ago

I think so.

ppigazzini commented 4 years ago

@tomtor fishtest and pypy3 working in a VM.

Here is the working installation:

sudo apt install -y gfortran libopenblas-dev build-essential
wget https://bitbucket.org/pypy/pypy/downloads/pypy3.6-v7.3.1-linux64.tar.bz2
tar -xf pypy3.6-v7.3.1-linux64.tar.bz2
pypy3.6-v7.3.1-linux64/bin/pypy3 -m venv ${VENV}

cd ${HOME}/fishtest/fishtest
${VENV}/bin/pip install -e .

The first setup is very slow, but fortunately numpy and scipy, after the build from sources, are cached and the workflow on DEV is still valid workflow for DEV should be changed.

pypy3 from snap and PPA has big problems with numpy/scipy:

ppigazzini commented 4 years ago

@tomtor DEV on pypy3 :)

vondele commented 4 years ago

@ppigazzini for me the following does on dev only return after a long time (>10s), but does return quickly on prod:

https://dfts-0.pigazzini.it/actions?action=&user=foobar

action of an non-existent user. Maybe that's unrelated (and now after the first very slow returns, seems to be faster).

Other than that things seem to be working well.

ppigazzini commented 4 years ago

DEV was using #574, so different mongodb indexing for LTC, Green, and runs. Now DEV is on par with PROD (+ #588)

ppigazzini commented 4 years ago

@tomtor @noobpwnftw I think that we should test if DEV with pypy3 is able to deal with a bigger number of workers.

tomtor commented 4 years ago

I think that we should test if DEV with pypy3 is able to deal with a bigger number of workers.

@ppigazzini I or @linrock could adapt #589 to generate and time some synthetic load on api/update_task. Probably best to merge #589 first?

noobpwnftw commented 4 years ago

Well, currently I put some load on prod server, how does it look? I can raise the number of workers gradually if you like.

linrock commented 4 years ago

some thoughts on scaling the server across multiple cores. this can be done by removing the global active_run_lock and dependency on a single-process run cache:

if i'm missing anything, let me know. batching requests on the worker also sounds like a good idea. the minimal work a worker needs to do is to send the game stats to the server, and not necessarily after every game.


@ppigazzini I or @linrock could adapt #589 to generate and time some synthetic load on api/update_task. Probably best to merge #589 first?

having a way to synthetically benchmark the server throughout would be very helpful because then we can quantify improvements. even a script that hammers a fishtest server with requests and measures requests/s would be great.

profiling calls to update_task would also be helpful for quantifying where exactly the CPU bottlenecks are.

xoto10 commented 4 years ago

Well, currently I put some load on prod server, how does it look? I can raise the number of workers gradually if you like.

I've done a simple 60s timing test on stc jobs, and calculated the number of games played according to the page for that test compared to a theoretical value of 2*corecount. 3 tests gave values of 63%, 78% and 81%, e.g. https://tests.stockfishchess.org/tests/view/5e9690f8c2718dee3c822963 gave 480 games played in 60s compared to ~590 expected with 296 cores. This is with 4 LTC tests running and an LTC tune (tune with only 92 cores though), I assume the LTC jobs lower the burden on the server. The rise from 63% of expected throughput to ~80% might come from 1 or 2 new LTC's starting between tests.

Currently ~7600 cores, so perhaps throughput / failed writes (under the current load) would be improved with 6000-6500 cores?

On the other hand, looking at my worker I don't see any failed updates, so perhaps the games are just 25% longer than I expect giving the 80% throughput number??

[ final test 550 games vs 658 expected - 84% ]

As an alternative measure, my worker is reporting updates generally taking 0.14/0.15s but occasionally 0.8s or even 1.0s. Perhaps an average of these is a better indication of server load. (We could make the workers calculate the 5 minute average?)

xoto10 commented 4 years ago
  * changing fields on a run or a task on a run currently requires overwriting the entire run

I think this is important and I intended to change this when I was looking at fishtest ages ago. From memory I think there are 2 kinds of updates (SPRT vs SPSA?) and the (SPRT?) ones could be a lot smaller, just updating one or two fields instead of the whole run structure.

Perhaps we could create to separate run lists, one for the simple updates and one that continues to use the current run.save() ?

tomtor commented 4 years ago

Well, currently I put some load on prod server, how does it look? I can raise the number of workers gradually if you like.

@noobpwnftw Thanks! Load on PROD (7672 cores) varies mostly between 40% and 85% now. I noticed one spike of 102%.

Note that DEV (not PROD) has the pypy setup. PROD has no performance related changes (yet).

xoto10 commented 4 years ago

@tomtor Are you able to see any I/O figures for prod? Often database servers will get I/O bound rather than limited by cpu. Not as easy to tell whether high figures are just a bit high or at a limit though, since not quoted as %.

tomtor commented 4 years ago

active_run_lock is currently required because:

  • changing fields on a run or a task on a run currently requires overwriting the entire run
  • stats are being calculated in every call to update_task

    • these are used to tell the worker whether to continue this task
  • the /tests pages read the most recent run results from cache

In the Python run cache fields ARE individually updated. The entire run IS being overwritten in MongoDb, but that is not an issue. The locks (one for each run) are ONLY needed for the correct calculation of the statistics (of each task and run total status) of each test.

The locks itself are not the bottleneck. The Python GIL causes threads to not run simultaneously and this causes the max cpu load to be limited at 100%. Load is sometimes over 100% because the Python libraries which are implemented in C (and not in Python) can have simultaneous active threads.

Reading the run data for showing it to users is indeed not very critical, it may be stale. Also notifying the worker that it may stop may be delayed.

tomtor commented 4 years ago

Are you able to see any I/O figures for prod? Often database servers will get I/O bound rather than limited by cpu. Not as easy to tell whether high figures are just a bit high or at a limit though, since not quoted as %.

@xoto10 io load is low (10-70 tps)

xoto10 commented 4 years ago

@tomtor

... The entire run IS being overwritten in MongoDb, but that is not an issue.

I thought this slows the updates down quite a bit? Am I right in thinking the db updates are the slow part of the processing at the moment?

tomtor commented 4 years ago

I thought this slows the updates down quite a bit? Am I right in thinking the db updates are the slow part of the processing at the moment?

No, only one single run is written each second from cache (dirty and least recently written).

vondele commented 4 years ago

I've done a simple 60s timing test on stc jobs, and calculated the number of games played according to the page for that test compared to a theoretical value of 2*corecount. 3 tests gave values of 63%, 78% and 81%, e.g. https://tests.stockfishchess.org/tests/view/5e9690f8c2718dee3c822963 gave 480 games played in 60s compared to ~590 expected with 296 cores.

@xoto10 actually, the estimate might be not so easy. For example, based on the NPS of the worker the TC is adjusted, and for example, two pgn's I downloaded from noobs workers show '81.947+0.819' or '91.545+0.915' for the actual TC so that's quite different from the '60+0.6s' one assumes. Also, the average game length depends quite strongly on the number of plies. We will need to monitor the number of games played per second a bit more carefully, but let's take that to the other issue

Edit: so actual average time for a game at '60+0.6' on one worker is 194s and 254s on another.

linrock commented 4 years ago

The locks (one for each run) are ONLY needed for the correct calculation of the statistics (of each task and run total status) of each test.

ah right, flushing the run buffer to MongoDB uses two other global locks (run_cache_lock and run_cache_write_lock), not the active_run_lock

In the Python run cache fields ARE individually updated. The entire run IS being overwritten in MongoDb, but that is not an issue.

this prevents concurrently writing runs to MongoDB. we'd need concurrent writes if we want to use multiple pserves for handling worker traffic. the entire run being overwritten is ok when writes are serialized like right now (multiple threads using global locks in a single process), but prevents scaling to multiple server cores because then concurrent writes would overwrite each other.

afaik, the entire run does not need to ever be overwritten. since the cache fields are individually updated, writing runs to MongoDB can update individual fields and nested fields (tasks) without overwriting the entire run (using update instead of replace_one). this is effectively using more granular locking, as @noobpwnftw had suggested.

The locks itself are not the bottleneck. The Python GIL causes threads to not run simultaneously and this causes the max cpu load to be limited at 100%.

the global locks are an architectural bottleneck because of the Python GIL and because the one pserve handling worker traffic hits the 100% cpu limit. the locks wouldn't be a bottleneck otherwise. we want to be able to serve worker traffic using all server cores, but that's not possible with a single server process since the GIL limits single-process cpu to 100%. some options are either (1) use a Python implementation without a GIL or (2) remove the need for the global locks, so multiple python processes can handle worker traffic concurrently.

Reading the run data for showing it to users is indeed not very critical, it may be stale. Also notifying the worker that it may stop may be delayed.

yup, and i think we can use these facts to design a server architecture that works around the GIL and allows multiple pserve processes handling traffic concurrency.


so i'm thinking these server architectural changes would increase worker throughput:


there are many options to scale # of worker cores fishtest can handle with the current hardware. i'll continue reading the code and think more on this. if anyone has thoughts on this, feel free to point them out

tomtor commented 4 years ago

this prevents concurrently writing runs to MongoDB. we'd need concurrent writes if we want to use multiple pserves for handling worker traffic.

Locks are local to each pserv. So if different runs consistently use their own allocated pserv instance (eg hash on run-id) then mongodb writes WILL be parallel and they will not overwrite each other results. Also updating mongodb is really no bottleneck in the current setup.

So it is not a difficult problem, only if we want multiple pserv instances handling the same run things get complicated. But as @vondele pointed out, that is not really needed.

xoto10 commented 4 years ago

I'm currently getting 502 Bad Gateway at fishtest ?

ppigazzini commented 4 years ago

@xoto10 I'm merging #574, pgns collection (20GB gzipped) is slowing down the process.

linrock commented 4 years ago

Locks are local to each pserv. So if different runs consistently use their own allocated pserv instance (eg hash on run-id) then mongodb writes WILL be parallel and they will not overwrite each other results.

true, initially i was thinking this would be complicated if it required a worker upgrade, but i just found out that nginx can parse JSON. so we can get the run_id from API request JSON payloads and hash that from nginx to route /api/update_task requests to the correct pserve without updating workers: http://nginx.org/en/docs/njs/examples.html

however, requests to /tests currently also writes to runs when calculating finished stats and those requests can't be routed by run_id so we'd need to change that for this hashing scheme to work. i think moving the stats calculations to a separate long-running process and making /tests only read from the runs collections (no writes) would solve this.

Also updating mongodb is really no bottleneck in the current setup.

i wasn't suggesting any mongodb performance improvements here. just that a more granular locking scheme for writes to the runs collection opens up architectural possibilities. i don't yet see a downside to using update instead of replace in all current use cases where runs are saved.

linrock commented 4 years ago

@ppigazzini the PGNs could be moved out of mongo and into gzipped static files. then nginx can serve them instead of having to have pserves handle PGN requests. in this case, the mongo pgn collection would store file paths to PGNs instead of zipped binary blobs and wouldn't need to be purged as frequently depending on how much disk space the server has.

also, we could update the devops process to either not re-create the index on the pgn collection or create the index in the background to minimize downtime when downtime is needed. it sounds like re-creating that one index takes a really long time (20+ minutes?) in prod.

ppigazzini commented 4 years ago

@linrock I approve the suggestion to serve the PGNs with nginx, but wait @tomtor opinion before start coding :) Yes, PGNs indexing takes 20 minutes... the script is running the indexing step now.

tomtor commented 4 years ago

however, requests to /tests currently also writes to runs when calculating finished stats and those requests can't be routed by run_id so we'd need to change that for this hashing scheme to work.

@linrock see https://stackoverflow.com/questions/31994395/how-to-use-url-pathname-as-upstream-hash-in-nginx

Edit: so individual test pages can be directed to the correct pserv for running tests. Finished tests are always read-only.

tomtor commented 4 years ago

I approve the suggestion to serve the PGNs with nginx, but wait @tomtor opinion before start coding :)

@linrock Serving pgns with nginx is fine with me...

linrock commented 4 years ago

i did a simple benchmark of python3 vs. PyPy for ELO stats and it seems PyPy is a lot slower.. These are the python versions i tested:


time python3 calc_stats.py ~ 130 calculations/second

real    0m7.668s
user    0m7.519s
sys 0m0.109s

time pypy3 calc_stats.py ~ 40 calculations/second

real    0m24.960s
user    0m24.546s
sys 0m0.198s

this is in calc_stats.py:

from fishtest.stats.stat_util import SPRT_elo

for i in range(0, 1000):
  SPRT_elo({
    'wins': 65388,
    'losses': 65804,
    'draws': 56553,
    'pentanomial': [10789, 19328, 33806, 19402, 10543]
  }, elo0=-3,  elo1=1, elo_model='BayesElo')
ppigazzini commented 4 years ago

@linrock same result on DEV:

linrock commented 4 years ago

Edit: so individual test pages can be directed to the correct pserv for running tests. Finished tests are always read-only.

true, the problem is that tests are currently marked as finished when someone visits the homepage (what i meant by /tests). when visiting the homepage and not hitting the cached results (cached for 2s), the request causes the pserve you're on to loop through all unfinished runs and then overwrite each run when it decides the run is finished.

overwriting runs is currently safe since /tests and /api/update_task are handled on one pserve and share the same read/write locks for runs and the same run cache.

but if /api/update_task requests are routed to different pserves, then homepage requests can no longer safely overwrite runs. possible options here are:

tomtor commented 4 years ago
  • runs are marked as finished during an /api/update_task request, not a homepage request

@linrock This is the most elegant option. When we drop the "auto purge" feature code hack (this has been discussed and @vondele has no problem with this) this is a trivial change.

tomtor commented 4 years ago

i did a simple benchmark of python3 vs. PyPy for ELO stats and it seems PyPy is a lot slower..

Yes, that is not very promising. Pypy is good at optimizing loops in Python code, but SPRT_Elo has much C library code. The same probably applies to the text/JSON handling in the HTTP/API stack, so I am afraid that Pypy will not offer us a boost in speed.

vondele commented 4 years ago

I think if there is no compelling argument in favour of pypy, it might be safer to avoid it.

btw, If I understand correctly, I think the timings are suggesting that calling SPRT_Elo could be the bottleneck. It's about 6ms per call, and right now, we're running at >100 games/s. If we update for every game, that fits with the load of +-60%.

This would also suggest that @linrock suggestion (IIRC) to just change the status of the running jobs e.g. once per second (and thus have 1 call per second per active job to SPRT_Elo), would fix the major bottleneck. That would not need batching on the worker, so might be an easier change.

ppigazzini commented 4 years ago

Here are the pypy official performance analyses: https://bitbucket.org/pypy/pypy/wiki/JitFriendliness https://www.pypy.org/performance.html

vondele commented 4 years ago

the reason why I'm 'moderately enthusiastic' is that it might be a dependency that gets us in trouble later. For example, certain packages don't seem to work out-of-the-box http://packages.pypy.org/ (including scipy, pandas, ...), which might be fine right now, but could e.g. be a problem with version updates. If we do get the server e.g. 2x faster, but later we have to roll-back pypy, we could suddenly be in trouble. If we have a chance to improve performance more fundamentally (so that it scales), I think that is better, but obviously might be more work upfront. However, I'll more than happy to leave the decision to adopt it to people with more experience, I'm really not too experienced in this, just adding my 2cents here.

tomtor commented 4 years ago

This would also suggest that @linrock suggestion (IIRC) to just change the status of the running jobs e.g. once per second (and thus have 1 call per second per active job to SPRT_Elo), would fix the major bottleneck. That would not need batching on the worker, so might be an easier change.

One advantage of batching on the worker is that network traffic is also reduced.

EDIT: Looking at the code at: https://github.com/glinscott/fishtest/blob/a3ae475b5af9565cbbaf913024c45bf18dd47e9b/worker/games.py#L330

If we only execute this API call when (game_cnt % N-processors == 0 or worker_finished). Would that do the job?