official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

fishtest at 40k cores #553

Closed vondele closed 2 years ago

vondele commented 4 years ago

@ppigazzini @tomtor right now might be a good moment to gather some statistics on the fishtest performance under 40000 cores load. It seems more or less OK, but suffering a bit.

ppigazzini commented 4 years ago

I'm 'moderately enthusiastic'

From the server maintenance POV I'm 'moderately skeptic' about pypy3 after passing a day long to get a working pypy3/numpy/scipy setup. Very little information, tried 4 distributions.

Not happy to view that pypy3 requires code refactoring to avoid a slowdown, too.

ps: http://packages.pypy.org/##scipy here is simply missing libopenblas-dev (or another BLAS package)

tomtor commented 4 years ago

Ok, let's no longer pay attention to pypy.

My vote would be to focus on the worker batching now, which could be a small code change.

vondele commented 4 years ago

I'm 'moderately enthusiastic'

I'm 'moderately skeptic'

could be meaning exactly the same thing ;-)

xoto10 commented 4 years ago

When fishtest has a lot of extra cores (thanks noob!) and not many jobs, I think it struggles a bit when a job finishes and 1000+ cores need to be reallocated. At other times, if I'm around and my jobs are at < -2 or -2.5 and fishtest is busy I tend to cut their throughput in half, to let more promising jobs get some extra cores. Perhaps we could modify the itp calculation to increase the drop in cores when jobs get below -2 or so ?

e.g.

if llr < -2:
    itp*= 1 + (2 + llr) / 2

(perhaps only if games played > 20000?)

[ I just saw DragonMist66 comment on fishtest being down, and I saw the "Bad Gateway" error earlier. I don't think it was down, I suspect it was just very busy for a short time ]

vondele commented 4 years ago

@xoto10 there are quite regular maintainence updates right now to add new features see https://github.com/glinscott/fishtest/pull/598#issuecomment-614579906 .... .

There is like 3-5 new patches merged every day on a running system, it is amazing downtimes are so short. Excellent work by @ppigazzini, @linrock and @tomtor

ppigazzini commented 4 years ago

There is like 3-5 new patches merged every day on a running system, it is amazing downtimes are so short. Excellent work by @ppigazzini, @linrock and @tomtor

EDIT

Merge process and fishtest downtime:

noobpwnftw commented 4 years ago

Made it to about 20K cores, looks stable, do you want to push it a little bit?

vondele commented 4 years ago

feels still responsive indeed. Maybe @ppigazzini can post a top ?

tomtor commented 4 years ago

It is around 70%....

vondele commented 4 years ago

not bad. We do have >50% of the cores assigned to LTC tests, so that makes it a bit easier.

tomtor commented 4 years ago

Pushed close by accident on my phone. Yes, LTC reduces load a lot.

linrock commented 4 years ago

looking good! all we need is another 2x speedup and we'll be at 40k cores :)

ppigazzini commented 4 years ago

50% < CPU core < 85%

top - 23:47:33 up 12 days,  7:14,  2 users,  load average: 1.07, 1.29, 1.23
Tasks:  33 total,   1 running,  32 sleeping,   0 stopped,   0 zombie
%Cpu0  :  10.0/1.0    11[|||||||||||                                                                                         ]
%Cpu1  :  13.7/1.0    15[|||||||||||||||                                                                                     ]
%Cpu2  :  18.9/1.0    20[||||||||||||||||||||                                                                                ]
%Cpu3  :  10.3/1.0    11[|||||||||||                                                                                         ]
KiB Mem : 65.5/5242880  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                  ]
KiB Swap:  0.0/0        [                                                                                                    ]

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20      224908   2316   1000 S        0.0   0:07.28 init -z
    2 root      20                           S                       `- [kthreadd/1988]
    3 root      20                           S                           `- [khelper]
   65 root      20      604284  27676  20376 S        0.5   9:48.71  `- /lib/systemd/systemd-journald
  167 root      20       42092    428    376 S        0.0   0:00.91  `- /lib/systemd/systemd-udevd
  170 systemd+  20       71848    412    348 S        0.0   0:01.07  `- /lib/systemd/systemd-networkd
  295 root      20       62100    720    548 S        0.0   0:01.44  `- /lib/systemd/systemd-logind
  296 message+  20       47724    628    300 S        0.0   0:00.58  `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  300 syslog    20      189016   1580    356 S        0.0   3:12.49  `- /usr/sbin/rsyslogd -n
  301 root      20      184296    308    272 S        0.0   0:00.11  `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
  305 mongodb   20     3448480 1.777g   5400 S   1.7 35.5   1410:54  `- /usr/bin/mongod --config /etc/mongod.conf
  314 root      20       72288    484    372 S        0.0   0:00.12  `- /usr/sbin/sshd -D
13714 root      20       97180    568    416 S        0.0                `- sshd: fishtest [priv]
13729 fishtest  20       97180    476    224 S        0.0   0:00.53          `- sshd: fishtest@pts/0
13730 fishtest  20       20480   1252    852 S        0.0   0:00.11              `- -bash
25680 root      20       97180   2096   1156 S        0.0   0:00.02      `- sshd: fishtest [priv]
25707 fishtest  20       97180   1600    652 S        0.0   0:00.27          `- sshd: fishtest@pts/1
25708 fishtest  20       19660   2352    764 S        0.0   0:00.04              `- -bash
30111 fishtest  20       36752   1644   1036 R        0.0   0:03.08                  `- top
  581 root      20       24176    288    284 S        0.0            `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
  594 Debian-+  20       59184    516    404 S        0.0   0:01.02  `- /usr/sbin/exim4 -bd -q30m
  597 root      20      100972    272    244 S        0.0            `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  598 root      20      100972     28        S        0.0                `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  619 root      20      142520   2576   1104 S        0.0   0:00.03  `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
31264 www-data  20      147840   8720   2400 S   5.0  0.2  70:19.62      `- nginx: worker process
31265 www-data  20      143716   4464   2248 S   0.7  0.1   3:22.87      `- nginx: worker process
31266 www-data  20      143320   4204   2132 S        0.1   0:01.77      `- nginx: worker process
31267 www-data  20      142864   3448   1652 S        0.1   0:00.06      `- nginx: worker process
16839 root      20       28344    744    460 S        0.0   0:00.44  `- /usr/sbin/cron -f
 2705 fishtest  20     2032992 547004   4564 S       10.4   8:22.20  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
 2709 fishtest  20     2344096 981876   4948 S  52.3 18.7 293:19.73  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
32003 root      20       13008    840    712 S        0.0            `- /sbin/agetty -o -p -- \u --noclear tty2 linux
32005 root      20       13008    848    716 S        0.0            `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
noobpwnftw commented 4 years ago

In fact, to support more throughput we might want to use centralized build servers and so on. But I'd say with current test submission rate, 20k cores are more than enough. We can queue tests over night and have the fleet finish them by the next day.

ppigazzini commented 4 years ago

not bad. We do have >50% of the cores assigned to LTC tests, so that makes it a bit easier.

I load on DEV the DB at the start of the 20k run (hourly backup):

Vizvezdenec commented 4 years ago

Let's be honest @noobpwnftw we can't load 20k cores on a stable basis :) I need to just write patches after writing patches for this... And not only me, but 3-4 people. We simply don't have this number of devs :D

Vizvezdenec commented 4 years ago

I'm not trying to be offensive and call all of you guys work useless, just saying that server is working fluid on 20k cores and we probably wouldn't need more in any forseen future - as soon as 2 people from currently online will go to sleep queue will go idle in no time.

linrock commented 4 years ago

maybe there are good ways to make use of the extra cores? examples:

Vizvezdenec commented 4 years ago

Well, scheduling test requires to actually write a code for it :) It's not that easy to beat speed of tests that complete if 20k cores are running - it's hard to produce any meaningful ideas at this tempo. About LTCs - usually people submit other people LTCs if queue goes close to idling. Smth can for sure be done about more devs but I don't know what. Maybe we can start some random big tuning just to poor computing power there but I have no clue how to do tuning so it's not up to me.

NKONSTANTAKIS commented 4 years ago

Seems to me good timing to raise the STC, ample resources, better correlation, less scaling suppression, higher quality chess, less server CPU load.

noobpwnftw commented 4 years ago

Extra cores will only appear when there is a long queue with meaningful tests to run. :)

vondele commented 4 years ago

yes, it is important to realize that this level of resources is exceptional. From experience, have 'near interactive' feedback on patch ideas really does help to come up with good stuff, but definitely there is no need to 'just fill the queues' be it with random tests or excessive changes in protocols.

I agree with @linrock to make it easy for new devs to join and to process to develop and test smooth... good ideas and an active community is what we need to foster.

With 1-2 orders of magnitude increase in resources, we might also be able to do completely new stuff, but that needs some thinking and experimenting. I do have some ideas there, but for prototyping fishtest might not be the best environment.

31m059 commented 4 years ago

I'm sorry this is such a long message :)

If I may, I would like to add two points in response to @Vizvezdenec:

Well, scheduling test requires to actually write a code for it :) It's not that easy to beat speed of tests that complete if 20k cores are running - it's hard to produce any meaningful ideas at this tempo. .... Smth can for sure be done about more devs but I don't know what.

First, I don't think there's actually anything wrong with letting the queue go empty, in principle. Yes, there's a technical issue of the workers all reconnecting at once when the next test is submitted, but we can fix that. We Stockfish developers are the only people I know who, while waiting on four other people in a queue, worry that there are too few people waiting ahead of us and wish there were more :)

Yes, it might seem that if the queue goes empty, the workers' CPU time is wasted. But I would argue a fairly unusual idea: in terms of the long-term trajectory of the Stockfish project, there is no such thing as a wasted CPU-hour on fishtest.

We've already seen in recent days how an excess of worker supply has produced an exceptional number of quality tests, and a rate of Elo progress unprecedented over years of Stockfish development. Having the queue close-to-empty provokes the submission of far more tests than otherwise, and some of these succeed. But there's more than this....

Before I joined the Stockfish developer community a little over 2 years ago, I watched quietly for a long time but never submitted a test. I was finally moved when I saw the framework going empty night after night--I believe this was shortly after @noobpwnftw began his donations? One month later I found my first Elo-gainer, and I was here to stay.

I know I'm not the only one like this. In response to your point,

Smth can for sure be done about more devs but I don't know what.

The supply of cores itself is a great approach. Over the past few years, it has certainly seemed to me that new developers tend to appear and join our team precisely when there is plenty of room for them on the framework. Even when the queue goes empty, every CPU-hour is put to good use supporting the long-term health of the Stockfish project.

In summary: more cores means more tests from our current developers, and more developers too!

Vizvezdenec commented 4 years ago

fishtest is dead?

vondele commented 4 years ago

yes, see also https://github.com/glinscott/fishtest/pull/611#issuecomment-616113388

noobpwnftw commented 4 years ago

Went to sleep with 10k cores running about 40 tests, so I guess after a while the queue went dry and load spiked?

vondele commented 4 years ago

yes trying to get the right flag for the right ip :-) (#611)

tomtor commented 4 years ago

@linrock @ppigazzini The show machines button when 10000 cores are active still cause:

Apr 21 14:57:04 tests pserve[2489]: 2020-04-21 14:57:04,292 WARNI [waitress.queue][MainThread] Task queue depth is 21

Generating/sending this long list is expensive, we will have to look at this...

vondele commented 4 years ago

BTW, with the arrival of 4-core workers, the maximum number of cores assigned to 1 tests is about 4000 (250000 games / test, 250 games / task -> 1000 machines, 4000 cores). This could be another thing to consider if we adjust max games / test, and task size.

linrock commented 4 years ago

yeah, generating the machines list is very expensive since it iterates over all tasks of all unfinished runs. then it has to render a long list. so the performance gets worse as # worker cores up.

probably better to continuously generate a cached machines list in a long-running process every few seconds so that homepage requests don't clog up the webserver queue. slow requests will back up the waitress queue, so ideally any expensive calculations don't happen during a web request/response cycle and get calculated of outside of pserve.

noobpwnftw commented 4 years ago

Fetching task slow & timeouts(with only 1500 workers).

Worker version 73 connecting to http://tests.stockfishchess.org:80
Fetch task...
Worker version checked successfully in 0.376313s
Task requested in 5.694125s
No tasks available at this time, waiting...

Fetch task...
Worker version checked successfully in 0.842559s
Task requested in 5.293982s
No tasks available at this time, waiting...

Fetch task...
Worker version checked successfully in 8.059501s
Task requested in 9.499962s
No tasks available at this time, waiting...

Fetch task...
Exception accessing host:
HTTPConnectionPool(host='tests.stockfishchess.org', port=80): Read timed out. (read timeout=15.0)

Looks like a lot of flags missing, might want to try this: https://pythonhosted.org/python-geoip/

ppigazzini commented 4 years ago

The geoip service is offline :( https://freegeoip.app/json/

tomtor commented 4 years ago

The geoip service is offline :(

614 dropped the Python flag cache. I will re-add it and limit concurrent flag queries to one in a PR.

xoto10 commented 4 years ago

Question: do tuning tests stress the server harder than ordinary sprt tests? I seem to remember we thought this a year or two ago when there were server issues. I notice there have been quite a few tuning runs recently, and just now there is one running at 5+0.05. Are these a concern?

tomtor commented 4 years ago

Question: do tuning tests stress the server harder than ordinary sprt tests?

Yes, they do, because the tuning parameters have to be transmitted and recomputed. In addition they have to be stored to draw the pretty graphs. More parameters increases the stress.

Edit: See also #586

Vizvezdenec commented 4 years ago

My worker can't fetch any tasks (says no tasks available) while fishtest is clearly not empty.

ppigazzini commented 4 years ago

@Vizvezdenec try now, I stopped a test with a bad time control format https://github.com/glinscott/fishtest/pull/628#issuecomment-620273391

Vizvezdenec commented 4 years ago

now it launched

vondele commented 4 years ago

@ppigazzini @tomtor can we have another look at server load? Currently at 3601 machines 20887 cores.

Edit: at least the main page is pretty reactive, seems to work well :+1:

and so far still good at 4803 machines 29678 cores

noobpwnftw commented 4 years ago

Looks like this issue is resolved. :)

xoto10 commented 4 years ago

Haha, let's see what happens when the workload dries up :)

ppigazzini commented 4 years ago
top - 19:00:43 up 4 days, 41 min,  2 users,  load average: 1.51, 1.25, 1.27
Tasks:  33 total,   1 running,  32 sleeping,   0 stopped,   0 zombie
%Cpu0  :  21.3/1.3    23[||||||||||||||||||||||                                                                              ]
%Cpu1  :  12.0/1.0    13[|||||||||||||                                                                                       ]
%Cpu2  :  26.2/1.7    28[||||||||||||||||||||||||||||                                                                        ]
%Cpu3  :  28.6/2.3    31[|||||||||||||||||||||||||||||||                                                                     ]
GiB Mem : 67.7/5.000    [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                ]
GiB Swap:  0.0/0.000    [                                                                                                    ]

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20      224860   2876   1572 S        0.1   0:02.24 init -z
    2 root      20                           S                       `- [kthreadd/1988]
    3 root      20                           S                           `- [khelper]
   57 root      20      236964  30224  27248 S        0.6   0:56.06  `- /lib/systemd/systemd-journald
  169 root      20       42092    272    228 S        0.0   0:00.29  `- /lib/systemd/systemd-udevd
  185 syslog    20      189016   1464    380 S        0.0   0:09.72  `- /usr/sbin/rsyslogd -n
  186 root      20       62120    792    676 S        0.0   0:00.44  `- /lib/systemd/systemd-logind
  192 root      20       28344    580    444 S        0.0   0:01.20  `- /usr/sbin/cron -f
  194 message+  20       47620    828    532 S        0.0   0:00.17  `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  203 systemd+  20       71848    496    436 S        0.0   0:00.36  `- /lib/systemd/systemd-networkd
  327 mongodb   20     3302980 1.672g   5416 S   5.6 33.4 320:51.01  `- /usr/bin/mongod --config /etc/mongod.conf
  335 root      20      184296   1712   1712 S        0.0   0:00.10  `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
  563 root      20       72288    592    484 S        0.0   0:00.02  `- /usr/sbin/sshd -D
 3151 root      20       97180   1404    488 S        0.0   0:00.01      `- sshd: fishtest [priv]
 3166 fishtest  20       97180   1116    248 S        0.0   0:00.09          `- sshd: fishtest@pts/1
 3167 fishtest  20       21432   1340    340 S        0.0   0:00.09              `- -bash
13090 root      20       97180   1724    780 S        0.0   0:00.02      `- sshd: fishtest [priv]
13111 fishtest  20       97180   1592    640 S        0.0   0:00.21          `- sshd: fishtest@pts/0
13112 fishtest  20       19528   1864    416 S        0.0   0:00.03              `- -bash
13124 fishtest  20       36752   1284    680 R   0.3  0.0   0:04.68                  `- top
  574 root      20       24176    336    324 S        0.0            `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
  580 Debian-+  20       59180    508    412 S        0.0   0:00.86  `- /usr/sbin/exim4 -bd -q30m
  582 root      20      100972    368    260 S        0.0            `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  583 root      20      100972    112        S        0.0                `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  585 root      20      142260    680     68 S        0.0            `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
  586 www-data  20      147240   7984   2344 S   6.0  0.2  68:33.27      `- nginx: worker process
  587 www-data  20      143676   3712   2188 S        0.1   0:57.90      `- nginx: worker process
  588 www-data  20      143348   2896   1692 S        0.1   0:00.11      `- nginx: worker process
  589 www-data  20      143256   2812   1760 S        0.1   0:00.03      `- nginx: worker process
 1663 fishtest  20     1895208 682636   4764 S   6.6 13.0   6:12.02  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
 1666 fishtest  20     2492668 1.026g   5128 S  78.7 20.5  77:53.24  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
14754 root      20       13008    844    712 S        0.0            `- /sbin/agetty -o -p -- \u --noclear tty2 linux
14755 root      20       13008    848    716 S        0.0            `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
xoto10 commented 4 years ago

I should say though - very impressive test throughput. Nice work guys !

ppigazzini commented 4 years ago

Looks like this issue is resolved. :)

@noobpwnftw not yet, we are here for the 40k :)

ppigazzini commented 2 years ago

IMO now fishtest should be able to pass the stress test, unfortunately we lack the 40k cores :(

garry-ut99 commented 4 months ago

Currently 52000+ cores at https://tests.stockfishchess.org/tests :

ss_02

ppigazzini commented 4 months ago

few days ago... image

garry-ut99 commented 4 months ago

Wow, it's like connecting to a supercomputer.