Closed vondele closed 2 years ago
I'm 'moderately enthusiastic'
From the server maintenance POV I'm 'moderately skeptic' about pypy3 after passing a day long to get a working pypy3/numpy/scipy setup. Very little information, tried 4 distributions.
Not happy to view that pypy3 requires code refactoring to avoid a slowdown, too.
ps:
http://packages.pypy.org/##scipy
here is simply missing libopenblas-dev
(or another BLAS package)
Ok, let's no longer pay attention to pypy.
My vote would be to focus on the worker batching now, which could be a small code change.
I'm 'moderately enthusiastic'
I'm 'moderately skeptic'
could be meaning exactly the same thing ;-)
When fishtest has a lot of extra cores (thanks noob!) and not many jobs, I think it struggles a bit when a job finishes and 1000+ cores need to be reallocated. At other times, if I'm around and my jobs are at < -2 or -2.5 and fishtest is busy I tend to cut their throughput in half, to let more promising jobs get some extra cores. Perhaps we could modify the itp calculation to increase the drop in cores when jobs get below -2 or so ?
e.g.
if llr < -2:
itp*= 1 + (2 + llr) / 2
(perhaps only if games played > 20000?)
[ I just saw DragonMist66 comment on fishtest being down, and I saw the "Bad Gateway" error earlier. I don't think it was down, I suspect it was just very busy for a short time ]
@xoto10 there are quite regular maintainence updates right now to add new features see https://github.com/glinscott/fishtest/pull/598#issuecomment-614579906 .... .
There is like 3-5 new patches merged every day on a running system, it is amazing downtimes are so short. Excellent work by @ppigazzini, @linrock and @tomtor
There is like 3-5 new patches merged every day on a running system, it is amazing downtimes are so short. Excellent work by @ppigazzini, @linrock and @tomtor
EDIT
Merge process and fishtest downtime:
systemctl restart fishtest@{6543..6544}
systemctl stop fishtest@ ; sleep 5 ; git clone ... ; pip install -e . ; systemctl start fishtest@
, here is the script)Made it to about 20K cores, looks stable, do you want to push it a little bit?
feels still responsive indeed. Maybe @ppigazzini can post a top
?
It is around 70%....
not bad. We do have >50% of the cores assigned to LTC tests, so that makes it a bit easier.
Pushed close by accident on my phone. Yes, LTC reduces load a lot.
looking good! all we need is another 2x speedup and we'll be at 40k cores :)
50% < CPU core < 85%
top - 23:47:33 up 12 days, 7:14, 2 users, load average: 1.07, 1.29, 1.23
Tasks: 33 total, 1 running, 32 sleeping, 0 stopped, 0 zombie
%Cpu0 : 10.0/1.0 11[||||||||||| ]
%Cpu1 : 13.7/1.0 15[||||||||||||||| ]
%Cpu2 : 18.9/1.0 20[|||||||||||||||||||| ]
%Cpu3 : 10.3/1.0 11[||||||||||| ]
KiB Mem : 65.5/5242880 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
KiB Swap: 0.0/0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 224908 2316 1000 S 0.0 0:07.28 init -z
2 root 20 S `- [kthreadd/1988]
3 root 20 S `- [khelper]
65 root 20 604284 27676 20376 S 0.5 9:48.71 `- /lib/systemd/systemd-journald
167 root 20 42092 428 376 S 0.0 0:00.91 `- /lib/systemd/systemd-udevd
170 systemd+ 20 71848 412 348 S 0.0 0:01.07 `- /lib/systemd/systemd-networkd
295 root 20 62100 720 548 S 0.0 0:01.44 `- /lib/systemd/systemd-logind
296 message+ 20 47724 628 300 S 0.0 0:00.58 `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
300 syslog 20 189016 1580 356 S 0.0 3:12.49 `- /usr/sbin/rsyslogd -n
301 root 20 184296 308 272 S 0.0 0:00.11 `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
305 mongodb 20 3448480 1.777g 5400 S 1.7 35.5 1410:54 `- /usr/bin/mongod --config /etc/mongod.conf
314 root 20 72288 484 372 S 0.0 0:00.12 `- /usr/sbin/sshd -D
13714 root 20 97180 568 416 S 0.0 `- sshd: fishtest [priv]
13729 fishtest 20 97180 476 224 S 0.0 0:00.53 `- sshd: fishtest@pts/0
13730 fishtest 20 20480 1252 852 S 0.0 0:00.11 `- -bash
25680 root 20 97180 2096 1156 S 0.0 0:00.02 `- sshd: fishtest [priv]
25707 fishtest 20 97180 1600 652 S 0.0 0:00.27 `- sshd: fishtest@pts/1
25708 fishtest 20 19660 2352 764 S 0.0 0:00.04 `- -bash
30111 fishtest 20 36752 1644 1036 R 0.0 0:03.08 `- top
581 root 20 24176 288 284 S 0.0 `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
594 Debian-+ 20 59184 516 404 S 0.0 0:01.02 `- /usr/sbin/exim4 -bd -q30m
597 root 20 100972 272 244 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
598 root 20 100972 28 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
619 root 20 142520 2576 1104 S 0.0 0:00.03 `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
31264 www-data 20 147840 8720 2400 S 5.0 0.2 70:19.62 `- nginx: worker process
31265 www-data 20 143716 4464 2248 S 0.7 0.1 3:22.87 `- nginx: worker process
31266 www-data 20 143320 4204 2132 S 0.1 0:01.77 `- nginx: worker process
31267 www-data 20 142864 3448 1652 S 0.1 0:00.06 `- nginx: worker process
16839 root 20 28344 744 460 S 0.0 0:00.44 `- /usr/sbin/cron -f
2705 fishtest 20 2032992 547004 4564 S 10.4 8:22.20 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
2709 fishtest 20 2344096 981876 4948 S 52.3 18.7 293:19.73 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
32003 root 20 13008 840 712 S 0.0 `- /sbin/agetty -o -p -- \u --noclear tty2 linux
32005 root 20 13008 848 716 S 0.0 `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
In fact, to support more throughput we might want to use centralized build servers and so on. But I'd say with current test submission rate, 20k cores are more than enough. We can queue tests over night and have the fleet finish them by the next day.
not bad. We do have >50% of the cores assigned to LTC tests, so that makes it a bit easier.
I load on DEV the DB at the start of the 20k run (hourly backup):
Let's be honest @noobpwnftw we can't load 20k cores on a stable basis :) I need to just write patches after writing patches for this... And not only me, but 3-4 people. We simply don't have this number of devs :D
I'm not trying to be offensive and call all of you guys work useless, just saying that server is working fluid on 20k cores and we probably wouldn't need more in any forseen future - as soon as 2 people from currently online will go to sleep queue will go idle in no time.
maybe there are good ways to make use of the extra cores? examples:
Well, scheduling test requires to actually write a code for it :) It's not that easy to beat speed of tests that complete if 20k cores are running - it's hard to produce any meaningful ideas at this tempo. About LTCs - usually people submit other people LTCs if queue goes close to idling. Smth can for sure be done about more devs but I don't know what. Maybe we can start some random big tuning just to poor computing power there but I have no clue how to do tuning so it's not up to me.
Seems to me good timing to raise the STC, ample resources, better correlation, less scaling suppression, higher quality chess, less server CPU load.
Extra cores will only appear when there is a long queue with meaningful tests to run. :)
yes, it is important to realize that this level of resources is exceptional. From experience, have 'near interactive' feedback on patch ideas really does help to come up with good stuff, but definitely there is no need to 'just fill the queues' be it with random tests or excessive changes in protocols.
I agree with @linrock to make it easy for new devs to join and to process to develop and test smooth... good ideas and an active community is what we need to foster.
With 1-2 orders of magnitude increase in resources, we might also be able to do completely new stuff, but that needs some thinking and experimenting. I do have some ideas there, but for prototyping fishtest might not be the best environment.
I'm sorry this is such a long message :)
If I may, I would like to add two points in response to @Vizvezdenec:
Well, scheduling test requires to actually write a code for it :) It's not that easy to beat speed of tests that complete if 20k cores are running - it's hard to produce any meaningful ideas at this tempo. .... Smth can for sure be done about more devs but I don't know what.
First, I don't think there's actually anything wrong with letting the queue go empty, in principle. Yes, there's a technical issue of the workers all reconnecting at once when the next test is submitted, but we can fix that. We Stockfish developers are the only people I know who, while waiting on four other people in a queue, worry that there are too few people waiting ahead of us and wish there were more :)
Yes, it might seem that if the queue goes empty, the workers' CPU time is wasted. But I would argue a fairly unusual idea: in terms of the long-term trajectory of the Stockfish project, there is no such thing as a wasted CPU-hour on fishtest.
We've already seen in recent days how an excess of worker supply has produced an exceptional number of quality tests, and a rate of Elo progress unprecedented over years of Stockfish development. Having the queue close-to-empty provokes the submission of far more tests than otherwise, and some of these succeed. But there's more than this....
Before I joined the Stockfish developer community a little over 2 years ago, I watched quietly for a long time but never submitted a test. I was finally moved when I saw the framework going empty night after night--I believe this was shortly after @noobpwnftw began his donations? One month later I found my first Elo-gainer, and I was here to stay.
I know I'm not the only one like this. In response to your point,
Smth can for sure be done about more devs but I don't know what.
The supply of cores itself is a great approach. Over the past few years, it has certainly seemed to me that new developers tend to appear and join our team precisely when there is plenty of room for them on the framework. Even when the queue goes empty, every CPU-hour is put to good use supporting the long-term health of the Stockfish project.
In summary: more cores means more tests from our current developers, and more developers too!
fishtest is dead?
Went to sleep with 10k cores running about 40 tests, so I guess after a while the queue went dry and load spiked?
yes trying to get the right flag for the right ip :-) (#611)
@linrock @ppigazzini The show machines button when 10000 cores are active still cause:
Apr 21 14:57:04 tests pserve[2489]: 2020-04-21 14:57:04,292 WARNI [waitress.queue][MainThread] Task queue depth is 21
Generating/sending this long list is expensive, we will have to look at this...
BTW, with the arrival of 4-core workers, the maximum number of cores assigned to 1 tests is about 4000 (250000 games / test, 250 games / task -> 1000 machines, 4000 cores). This could be another thing to consider if we adjust max games / test, and task size.
yeah, generating the machines list is very expensive since it iterates over all tasks of all unfinished runs. then it has to render a long list. so the performance gets worse as # worker cores up.
probably better to continuously generate a cached machines list in a long-running process every few seconds so that homepage requests don't clog up the webserver queue. slow requests will back up the waitress queue, so ideally any expensive calculations don't happen during a web request/response cycle and get calculated of outside of pserve.
Fetching task slow & timeouts(with only 1500 workers).
Worker version 73 connecting to http://tests.stockfishchess.org:80
Fetch task...
Worker version checked successfully in 0.376313s
Task requested in 5.694125s
No tasks available at this time, waiting...
Fetch task...
Worker version checked successfully in 0.842559s
Task requested in 5.293982s
No tasks available at this time, waiting...
Fetch task...
Worker version checked successfully in 8.059501s
Task requested in 9.499962s
No tasks available at this time, waiting...
Fetch task...
Exception accessing host:
HTTPConnectionPool(host='tests.stockfishchess.org', port=80): Read timed out. (read timeout=15.0)
Looks like a lot of flags missing, might want to try this: https://pythonhosted.org/python-geoip/
The geoip service is offline :( https://freegeoip.app/json/
The geoip service is offline :(
Question: do tuning tests stress the server harder than ordinary sprt tests? I seem to remember we thought this a year or two ago when there were server issues. I notice there have been quite a few tuning runs recently, and just now there is one running at 5+0.05. Are these a concern?
Question: do tuning tests stress the server harder than ordinary sprt tests?
Yes, they do, because the tuning parameters have to be transmitted and recomputed. In addition they have to be stored to draw the pretty graphs. More parameters increases the stress.
Edit: See also #586
My worker can't fetch any tasks (says no tasks available) while fishtest is clearly not empty.
@Vizvezdenec try now, I stopped a test with a bad time control format https://github.com/glinscott/fishtest/pull/628#issuecomment-620273391
now it launched
@ppigazzini @tomtor can we have another look at server load? Currently at 3601 machines 20887 cores.
Edit: at least the main page is pretty reactive, seems to work well :+1:
and so far still good at 4803 machines 29678 cores
Looks like this issue is resolved. :)
Haha, let's see what happens when the workload dries up :)
top - 19:00:43 up 4 days, 41 min, 2 users, load average: 1.51, 1.25, 1.27
Tasks: 33 total, 1 running, 32 sleeping, 0 stopped, 0 zombie
%Cpu0 : 21.3/1.3 23[|||||||||||||||||||||| ]
%Cpu1 : 12.0/1.0 13[||||||||||||| ]
%Cpu2 : 26.2/1.7 28[|||||||||||||||||||||||||||| ]
%Cpu3 : 28.6/2.3 31[||||||||||||||||||||||||||||||| ]
GiB Mem : 67.7/5.000 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
GiB Swap: 0.0/0.000 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 224860 2876 1572 S 0.1 0:02.24 init -z
2 root 20 S `- [kthreadd/1988]
3 root 20 S `- [khelper]
57 root 20 236964 30224 27248 S 0.6 0:56.06 `- /lib/systemd/systemd-journald
169 root 20 42092 272 228 S 0.0 0:00.29 `- /lib/systemd/systemd-udevd
185 syslog 20 189016 1464 380 S 0.0 0:09.72 `- /usr/sbin/rsyslogd -n
186 root 20 62120 792 676 S 0.0 0:00.44 `- /lib/systemd/systemd-logind
192 root 20 28344 580 444 S 0.0 0:01.20 `- /usr/sbin/cron -f
194 message+ 20 47620 828 532 S 0.0 0:00.17 `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
203 systemd+ 20 71848 496 436 S 0.0 0:00.36 `- /lib/systemd/systemd-networkd
327 mongodb 20 3302980 1.672g 5416 S 5.6 33.4 320:51.01 `- /usr/bin/mongod --config /etc/mongod.conf
335 root 20 184296 1712 1712 S 0.0 0:00.10 `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
563 root 20 72288 592 484 S 0.0 0:00.02 `- /usr/sbin/sshd -D
3151 root 20 97180 1404 488 S 0.0 0:00.01 `- sshd: fishtest [priv]
3166 fishtest 20 97180 1116 248 S 0.0 0:00.09 `- sshd: fishtest@pts/1
3167 fishtest 20 21432 1340 340 S 0.0 0:00.09 `- -bash
13090 root 20 97180 1724 780 S 0.0 0:00.02 `- sshd: fishtest [priv]
13111 fishtest 20 97180 1592 640 S 0.0 0:00.21 `- sshd: fishtest@pts/0
13112 fishtest 20 19528 1864 416 S 0.0 0:00.03 `- -bash
13124 fishtest 20 36752 1284 680 R 0.3 0.0 0:04.68 `- top
574 root 20 24176 336 324 S 0.0 `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
580 Debian-+ 20 59180 508 412 S 0.0 0:00.86 `- /usr/sbin/exim4 -bd -q30m
582 root 20 100972 368 260 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
583 root 20 100972 112 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
585 root 20 142260 680 68 S 0.0 `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
586 www-data 20 147240 7984 2344 S 6.0 0.2 68:33.27 `- nginx: worker process
587 www-data 20 143676 3712 2188 S 0.1 0:57.90 `- nginx: worker process
588 www-data 20 143348 2896 1692 S 0.1 0:00.11 `- nginx: worker process
589 www-data 20 143256 2812 1760 S 0.1 0:00.03 `- nginx: worker process
1663 fishtest 20 1895208 682636 4764 S 6.6 13.0 6:12.02 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
1666 fishtest 20 2492668 1.026g 5128 S 78.7 20.5 77:53.24 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
14754 root 20 13008 844 712 S 0.0 `- /sbin/agetty -o -p -- \u --noclear tty2 linux
14755 root 20 13008 848 716 S 0.0 `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
I should say though - very impressive test throughput. Nice work guys !
Looks like this issue is resolved. :)
@noobpwnftw not yet, we are here for the 40k :)
IMO now fishtest should be able to pass the stress test, unfortunately we lack the 40k cores :(
Currently 52000+ cores at https://tests.stockfishchess.org/tests :
few days ago...
Wow, it's like connecting to a supercomputer.
@ppigazzini @tomtor right now might be a good moment to gather some statistics on the fishtest performance under 40000 cores load. It seems more or less OK, but suffering a bit.