Closed vondele closed 2 years ago
top - 21:54:13 up 40 days, 21:37, 1 user, load average: 2.05, 2.04, 1.97
Tasks: 31 total, 2 running, 29 sleeping, 0 stopped, 0 zombie
%Cpu0 : 34.2/1.7 36[|||||||||||||||||||||||||||||||||||| ]
%Cpu1 : 39.1/2.0 41[||||||||||||||||||||||||||||||||||||||||| ]
%Cpu2 : 23.0/1.3 24[|||||||||||||||||||||||| ]
%Cpu3 : 33.2/1.3 35[|||||||||||||||||||||||||||||||||| ]
GiB Mem : 67.1/5.000 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
GiB Swap: 0.0/0.000 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 225020 3120 1584 S 0.1 0:20.24 init -z
2 root 20 S `- [kthreadd/1988]
3 root 20 S `- [khelper]
57 root 20 346764 189008 178700 S 1.0 3.6 4:12.98 `- /lib/systemd/systemd-journald
167 root 20 42092 472 424 S 0.0 0:03.27 `- /lib/systemd/systemd-udevd
172 systemd+ 20 71848 404 336 S 0.0 0:03.79 `- /lib/systemd/systemd-networkd
293 root 20 28344 536 436 S 0.0 0:10.87 `- /usr/sbin/cron -f
307 message+ 20 47748 1060 628 S 0.0 0:01.70 `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
310 root 20 62144 792 616 S 0.0 0:04.77 `- /lib/systemd/systemd-logind
315 syslog 20 189016 1624 444 S 0.3 0.0 0:26.64 `- /usr/sbin/rsyslogd -n
318 root 20 72288 616 504 S 0.0 0:00.04 `- /usr/sbin/sshd -D
28685 root 20 97180 4092 3148 S 0.1 0:00.01 `- sshd: fishtest [priv]
28709 fishtest 20 97180 1872 928 S 0.0 `- sshd: fishtest@pts/0
28710 fishtest 20 18632 2128 1580 S 0.0 `- -bash
28726 fishtest 20 36752 1936 1380 R 0.0 0:00.13 `- top
334 root 20 184296 1576 1576 S 0.0 0:00.10 `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
356 root 20 24176 300 296 S 0.0 `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
573 root 20 100972 240 236 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
574 root 20 100972 4 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
583 Debian-+ 20 59180 500 404 S 0.0 0:02.90 `- /usr/sbin/exim4 -bd -q30m
18349 root 20 142520 1528 944 S 0.0 0:00.01 `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
520 www-data 20 147488 8224 2428 R 4.3 0.2 433:45.79 `- nginx: worker process
521 www-data 20 144872 5172 2360 S 0.1 11:23.62 `- nginx: worker process
522 www-data 20 143724 3576 2220 S 0.1 0:01.50 `- nginx: worker process
523 www-data 20 143344 3204 2076 S 0.1 0:00.37 `- nginx: worker process
22838 uuidd 20 25192 188 188 S 0.0 `- /usr/sbin/uuidd --socket-activation
21164 mongodb 20 3385516 1.672g 6064 S 30.9 33.4 201:36.49 `- /usr/bin/mongod --config /etc/mongod.conf
21206 fishtest 20 2472712 1.092g 5172 S 99.3 21.8 911:46.03 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
21207 fishtest 20 2035064 575920 4216 S 3.0 11.0 21:56.88 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
24529 root 20 13008 840 712 S 0.0 `- /sbin/agetty -o -p -- \u --noclear tty2 linux
24531 root 20 13008 844 716 S 0.0 `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
So this one is at 100%: 21206 fishtest 20 2472712 1.092g 5172 S 99.3 21.8 911:46.03 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
upstream backend_tests {
server 127.0.0.1:6543;
}
upstream backend_all {
server 127.0.0.1:6544;
}
server {
listen 80;
listen [::]:80;
server_name tests.stockfishchess.org;
location / {
return 301 https://$host$request_uri;
}
location /api {
proxy_pass http://backend_tests;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Port $server_port;
client_max_body_size 10m;
client_body_buffer_size 128k;
proxy_connect_timeout 60s;
proxy_send_timeout 90s;
proxy_read_timeout 90s;
proxy_buffering off;
proxy_temp_file_write_size 64k;
proxy_redirect off;
location ~ ^/api/(active_runs|download_pgn|download_pgn_100|request_version|upload_pgn) {
proxy_pass http://backend_all;
}
}
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2 ipv6only=on;
ssl_certificate /etc/letsencrypt/live/tests.stockfishchess.org/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/tests.stockfishchess.org/privkey.pem;
include /etc/letsencrypt/options-ssl-nginx.conf;
ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
server_name tests.stockfishchess.org;
location ~ ^/(css|html|img|js|favicon.ico|robots.txt) {
root /home/fishtest/fishtest/fishtest/fishtest/static;
expires 1y;
add_header Cache-Control public;
access_log off;
}
location / {
proxy_pass http://backend_all;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Port $server_port;
client_max_body_size 10m;
client_body_buffer_size 128k;
proxy_connect_timeout 60s;
proxy_send_timeout 90s;
proxy_read_timeout 90s;
proxy_buffering off;
proxy_temp_file_write_size 64k;
proxy_redirect off;
location ~ ^/api/(active_runs|download_pgn|download_pgn_100|request_version|upload_pgn) {
proxy_pass http://backend_all;
}
location /api/ {
proxy_pass http://backend_tests;
}
location /tests/user/ {
proxy_pass http://backend_all;
}
location /tests {
proxy_pass http://backend_tests;
}
}
}
@vondele api served by backend_test
are stateful, so we cannot add (easily) other backend processes.
guess we have just to get some passes for these STC patches and turn them into LTC, and things will ease on the server :-)
having said that, that's thus the single process bottleneck. I guess there is not much of low hanging fruit there.
Is it possible to load balance by test? Should not need global lock for different tests.
Reading https://github.com/glinscott/fishtest/issues/133#issuecomment-526931028 perhaps we could try to use nginx hash
to load balancing the id_tests on some processes
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#hash
After turning off 2000 new workers, it seems more stable now. Currently there are 1748 new workers running, except for the regular ones. New workers have 4 cores each.
@ppigazzini can you post another top ?
top - 22:39:08 up 40 days, 22:22, 1 user, load average: 2.09, 1.83, 1.86
Tasks: 31 total, 2 running, 29 sleeping, 0 stopped, 0 zombie
%Cpu0 : 41.3/3.3 45[|||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu1 : 10.3/1.0 11[||||||||||| ]
%Cpu2 : 48.7/2.0 51[||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu3 : 7.0/0.7 8[|||||||| ]
GiB Mem : 69.6/5.000 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
GiB Swap: 0.0/0.000 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 225020 3176 1640 S 0.1 0:20.26 init -z
2 root 20 S `- [kthreadd/1988]
3 root 20 S `- [khelper]
57 root 20 203012 80348 70032 S 0.3 1.5 4:53.95 `- /lib/systemd/systemd-journald
167 root 20 42092 416 368 S 0.0 0:03.29 `- /lib/systemd/systemd-udevd
172 systemd+ 20 71848 312 244 S 0.0 0:03.80 `- /lib/systemd/systemd-networkd
293 root 20 28344 536 436 S 0.0 0:10.88 `- /usr/sbin/cron -f
307 message+ 20 47748 1084 652 S 0.0 0:01.70 `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
310 root 20 62144 896 720 S 0.0 0:04.78 `- /lib/systemd/systemd-logind
315 syslog 20 189016 1784 596 S 0.0 0:41.38 `- /usr/sbin/rsyslogd -n
318 root 20 72288 544 432 S 0.0 0:00.04 `- /usr/sbin/sshd -D
28685 root 20 97180 1624 680 S 0.0 0:00.01 `- sshd: fishtest [priv]
28709 fishtest 20 97180 1376 432 S 0.0 0:00.18 `- sshd: fishtest@pts/0
28710 fishtest 20 18632 1380 800 S 0.0 `- -bash
1697 fishtest 20 36620 1824 1280 R 0.3 0.0 0:00.04 `- top
334 root 20 184296 1576 1576 S 0.0 0:00.10 `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
356 root 20 24176 300 296 S 0.0 `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
573 root 20 100972 240 236 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
574 root 20 100972 4 S 0.0 `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
583 Debian-+ 20 59180 500 404 S 0.0 0:02.90 `- /usr/sbin/exim4 -bd -q30m
18349 root 20 142520 1524 940 S 0.0 0:00.01 `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
520 www-data 20 147788 8420 2460 S 7.6 0.2 436:27.96 `- nginx: worker process
521 www-data 20 144876 5320 2392 S 0.3 0.1 11:53.64 `- nginx: worker process
522 www-data 20 143724 3564 2208 S 0.1 0:01.72 `- nginx: worker process
523 www-data 20 143344 3196 2068 S 0.1 0:00.41 `- nginx: worker process
22838 uuidd 20 25192 188 188 S 0.0 `- /usr/sbin/uuidd --socket-activation
21164 mongodb 20 3385516 1.670g 5816 S 6.0 33.4 215:12.52 `- /usr/bin/mongod --config /etc/mongod.conf
21206 fishtest 20 2583040 1.201g 5140 R 101.7 24.0 956:00.69 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
21207 fishtest 20 2035320 593664 4064 S 1.3 11.3 23:34.29 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
31562 root 20 13008 844 712 S 0.0 `- /sbin/agetty -o -p -- \u --noclear tty2 linux
31563 root 20 13008 848 716 S 0.0 `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
Before noob turned off the 2000, my worker was producing the following:
Worker started in C:\fishtest-master\worker ...
Memory: 34287968256 Worker version 72 connecting to https://tests.stockfishchess.org:443 Fetch task... Exception accessing host: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
And the fishtest page gives the following:
There is a tipping point for the current server, past that the workers will timeout and the effect cascades. Looks like somewhere around 18K cores.
Now it seems down, I turned off all new workers, but it does not recover.
Yeah server is dead now.
The VPS seems down, I wrote to @glinscott asking to check, finger crossed.
The VPS seems down, I wrote to @glinscott asking to check, finger crossed.
Probably reached a network limit to guard against DOS-attacks
Reading #133 (comment) perhaps we could try to use nginx
hash
to load balancing the id_tests on some processes http://nginx.org/en/docs/http/ngx_http_upstream_module.html#hash
See https://stackoverflow.com/questions/31994395/how-to-use-url-pathname-as-upstream-hash-in-nginx
We would have to add the test_id as additional URI parameter (...?5e8909804411759d9d099a78
) in the update call from the worker and also to live_elo scripts and other query pages. That would allow distributing worker updates/queries over pserv instances.
Problem remains handing out new jobs to worker requests. If a test terminates then the single central work allocator has to be notified of this. It could more frequently poll MongoDb (which is updated when a test finishes) as a simple implementation.
Is fishtest able to tell when the server is very busy?
It would be nice if we had some default unimportant (or just null) LTC job that we could give to workers if the server was overloaded, that would maybe ease the problems before it goes too far.
I've done this manually before, and could have done last night, but it didn't occur to me until too late.
It looks like we might have exceeded the CPU load.
I've rebooted the server, it should be coming back up now.
Logs
$ sudo journalctl -u fishtest@6543 -n 1000
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,838 WARNI [waitress.queue][MainThread] Task queue depth is 427
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,839 WARNI [waitress.queue][MainThread] Task queue depth is 428
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,843 WARNI [waitress.queue][MainThread] Task queue depth is 428
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,846 WARNI [waitress.queue][MainThread] Task queue depth is 429
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,847 WARNI [waitress.queue][MainThread] Task queue depth is 430
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,849 WARNI [waitress.queue][MainThread] Task queue depth is 431
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Starting server in PID 21206.
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Serving on http://localhost:6543
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: JojoM
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: tvijlbrief
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: ChessDBCN
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: 0x3C33
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: ChessDBCN
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: drabel
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: JojoM
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: yurikvelo
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: dew
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: ChessDBCN
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: flush
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: .......Traceback (most recent call last):
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/bin/pserve", line 8, in <module>
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: sys.exit(main())
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pyramid/scripts/pserve.py", line 34, in main
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: return command.run()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pyramid/scripts/pserve.py", line 285, in run
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: server(app)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/paste/deploy/loadwsgi.py", line 195, in server_wrapper
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: **context.local_conf)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/paste/deploy/util.py", line 55, in fix_call
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: val = callable(*args, **kw)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/__init__.py", line 22, in serve_paste
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: serve(app, **kw)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/__init__.py", line 18, in serve
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: server.run()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/server.py", line 332, in run
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: use_poll=self.adj.asyncore_use_poll,
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/wasyncore.py", line 247, in loop
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: poll_fun(timeout, map)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/wasyncore.py", line 174, in poll
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: r, w, e = select.select(r, w, e, timeout)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/fishtest/rundb.py", line 163, in exit_run
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: last_rundb.flush_all()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/fishtest/rundb.py", line 220, in flush_all
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: self.runs.save(self.run_cache[r_id]['run'])
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 3163, in save
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: collation=collation)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 856, in _update_retryable
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: _update, session)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1491, in _retryable_write
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: return self._retry_with_session(retryable, func, s, None)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: return func(session, sock_info, retryable)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 852, in _update
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: retryable_write=retryable_write)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 822, in _update
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: retryable_write=retryable_write).copy()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/pool.py", line 618, in command
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: self._raise_connection_failure(error)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/pool.py", line 613, in command
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: user_fields=user_fields)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/network.py", line 167, in command
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: parse_write_concern_error=parse_write_concern_error)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/helpers.py", line 136, in _check_command_response
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: raise NotMasterError(errmsg, response)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: pymongo.errors.NotMasterError: interrupted at shutdown
$ sudo journalctl -u fishtest@6544 -n 1000
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,038 WARNI [waitress.queue][MainThread] Task queue depth is 5
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,038 WARNI [waitress.queue][MainThread] Task queue depth is 6
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,038 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,039 WARNI [waitress.queue][MainThread] Task queue depth is 8
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,039 WARNI [waitress.queue][MainThread] Task queue depth is 9
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,040 WARNI [waitress.queue][MainThread] Task queue depth is 10
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,040 WARNI [waitress.queue][MainThread] Task queue depth is 11
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,041 WARNI [waitress.queue][MainThread] Task queue depth is 12
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,041 WARNI [waitress.queue][MainThread] Task queue depth is 13
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,042 WARNI [waitress.queue][MainThread] Task queue depth is 14
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,042 WARNI [waitress.queue][MainThread] Task queue depth is 15
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,043 WARNI [waitress.queue][MainThread] Task queue depth is 16
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,046 WARNI [waitress.queue][MainThread] Task queue depth is 17
Apr 05 02:05:55 tests.stockfishchess.org systemd[1]: Stopping Fishtest Server port 6544...
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Invalid login: "KatyaTal" "69#dzbHot2J"
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Invalid login: "Craigbob" "k9A1%r1vpdU"
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Starting server in PID 21207.
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Serving on http://localhost:6544
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: ['missing-input-response']
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: flush
$ sudo journalctl -u mongod -n 100
Mar 02 17:56:54 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Apr 01 03:01:00 tests.stockfishchess.org systemd[1]: Stopping MongoDB Database Server...
Apr 01 03:01:11 tests.stockfishchess.org systemd[1]: Stopped MongoDB Database Server.
Apr 01 03:01:11 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Apr 05 02:05:55 tests.stockfishchess.org systemd[1]: Stopping MongoDB Database Server...
Apr 05 02:06:36 tests.stockfishchess.org systemd[1]: Stopped MongoDB Database Server.
Apr 05 16:33:36 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
CPU load seems to get very high every hour or so. @ppigazzini could that be a regular backup (coincides with outgoing traffic) or so, possibly even multithreaded? If so, we could try to reduce the number of threads for that process and/or renice/cpulimit ?
It's the hourly mongodb backup.
On DEV:
mongodump
is multithreaded, first 5 seconds CPU 270%, then 180% gzip
is single thread, CPU 98%Tried:
nice
has no effect because 3 CPU cores are totally unloadedcpulimit -l 100 -- mongodump ...
is able to keep 70% < CPU < 110% so that seems one way of keeping this under control. Maybe even -l 50. Eventually we have to make sure this backup is finished before the next one starts... don't know how we have this scheduled.
A backup every 3 hours combined with "-l 50" would be sufficient?
I think that would be OK.... BTW, I don't know mongodb, is the db locked during this dumping time?
I think that would be OK.... BTW, I don't know mongodb, is the db locked during this dumping time?
No it remains writable/queryable
so I think we should go to cpulimit -l 50 -- mongodump ... and do it every 3h.
I'm playing a bit with the systemd config for cron.service https://askubuntu.com/questions/71211/how-to-use-cpulimit-on-all-cron-jobs This is not able to limit the CPU usage when the cores are unloaded, though.
cron tasks, in theory, will not execute if the previous one is still in progress. VPS went down after a while operating normally with reduced workers, so it should be hitting other limits.
How about locking the mongo database with an array of locks mod task id, instead of one?
How about locking the mongo database with an array of locks mod task id, instead of one?
@noobpwnftw I do not think we have a MongoDb bottleneck.
@noobpwnftw I do not think we have a MongoDb bottleneck.
No we don't, but the Python server code takes one global lock for updating all tasks. I was suggesting if we can use a more granular locking scheme there.
@noobpwnftw I do not think we have a MongoDb bottleneck.
No we don't, but the Python server code takes one global lock for updating all tasks. I was suggesting if we can use a more granular locking scheme there.
I am pretty sure we have a lock per test stored in a map indexed on id.
Edit: note that this will not increase throughput because only one Python thread is active at a time. We would have to rewrite the code after these locks in Rust or C and release the GIL to make use of the multiple locks.
On DEV I tested these 2 solutions (both take 11 minutes to finish), let me know your preference:
cpulimit
in crontab
: this should affect also the backup speed on s3 (commented out on DEV, not tested)
3 * * * * /usr/bin/cpulimit -l 50 -f -m -- sh ${UPATH}/backup.sh
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 159240 3560 2268 S 0.1 0:01.86 init -z
2 root 20 S `- [kthreadd/13518]
3 root 20 S `- [khelper]
59 root 20 111784 6352 5368 S 0.1 0:02.56 `- /lib/systemd/systemd-journald
166 root 20 42092 992 940 S 0.0 0:00.07 `- /lib/systemd/systemd-udevd
169 systemd+ 20 71848 1132 1064 S 0.0 0:00.09 `- /lib/systemd/systemd-networkd
181 message+ 20 47620 1272 960 S 0.0 0:00.12 `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
190 root 20 28344 1132 980 S 0.0 0:00.19 `- /usr/sbin/cron -f
24762 root 20 45140 1332 988 S 0.0 `- /usr/sbin/CRON -f
24763 usr00 20 4616 656 568 S 0.0 `- /bin/sh -c /usr/bin/cpulimit -l 50 -f -m -- sh /home/usr00/__test/test_backup.sh
24764 usr00 20 82508 676 564 S 0.3 0.0 0:00.02 `- /usr/bin/cpulimit -l 50 -f -m -- sh /home/usr00/__test/test_backup.sh
24765 usr00 20 4616 664 572 S 0.0 `- sh /home/usr00/__test/test_backup.sh
24767 usr00 20 1289292 206324 4996 T 63.7 3.3 0:07.72 `- mongodump --db=fishtest_new --excludeCollection=pgns --gzip
24787 usr00 20 82508 704 592 S 0.3 0.0 0:00.02 `- /usr/bin/cpulimit -l 50 -p 24767 -z -m
cpulimit
in backup.sh
: here is possible to backup on s3 at max speed
for db_name in "fishtest_new" "admin" "fishtest" "fishtest_testing"; do
cpulimit -l 50 -m -f -- mongodump --db=${db_name} --excludeCollection="pgns" --gzip
done
tar -cv dump | cpulimit -l 50 -m -f -- gzip -1 > dump.tar.gz
rm -rf dump
date_utc=$(date +%Y%m%d --utc) ${VENV}/bin/aws s3 cp dump.tar.gz s3://fishtest/backup/archive/${date_utc}/dump.tar.gz
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 159240 3556 2264 S 0.1 0:01.85 init -z
2 root 20 S - [kthreadd/13518] 3 root 20 S
- [khelper]
59 root 20 111784 14360 13388 S 0.2 0:02.52 - /lib/systemd/systemd-journald 166 root 20 42092 992 940 S 0.0 0:00.07
- /lib/systemd/systemd-udevd
169 systemd+ 20 71848 1132 1064 S 0.0 0:00.09 - /lib/systemd/systemd-networkd 181 message+ 20 47620 1272 960 S 0.0 0:00.12
- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
190 root 20 28344 1132 980 S 0.0 0:00.18 - /usr/sbin/cron -f 24600 root 20 45140 1332 988 S 0.0
- /usr/sbin/CRON -f
24601 usr00 20 4616 660 572 S 0.0 - /bin/sh -c sh /home/usr00/__test/test_backup.sh 24602 usr00 20 4616 660 572 S 0.0
- sh /home/usr00/__test/test_backup.sh
24603 usr00 20 82508 680 564 S 0.3 0.0 0:00.10 - cpulimit -l 50 -m -f -- mongodump --db=fishtest_new --excludeCollection=pgns --gzip 24604 usr00 20 1289292 207572 5144 R 75.7 3.3 0:24.70
- mongodump --db=fishtest_new --excludeCollection=pgns --gzip
my 2 cents... option 1 is simpler and robust. I'm actually not sure an IO bound task will be much influenced by cpulimit, so transfer rates might be still high. And even if the rates are a bit lower, that's actually not such a bad thing if the outgoing bandwidth is spiky as well.
@vondele I tried with a download script and the transfer rate seems unaffected. Here is how cpulimit works "Basically, the target process, which you can specify by pid, name, or command line, is continuosly paused and resumed by sending it SIGSTOP and SIGCONT signals."
PROD updated with option 1
so we would be ready for some testing by @noobpwnftw ? Maybe gentle to start with ;-)
Sure, gradually increasing workers.
Backup using option 1 successfully finished in 14 minutes.
It now shows a few 500 internal server error when I refresh.
From sudo journalctl -u fishtets@6543 -f
"Task queue depth" continually swing between 1 and 60-110.
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,543 WARNI [waitress.queue][MainThread] Task queue depth is 42
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,543 WARNI [waitress.queue][MainThread] Task queue depth is 43
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,543 WARNI [waitress.queue][MainThread] Task queue depth is 44
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,547 WARNI [waitress.queue][MainThread] Task queue depth is 44
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,556 WARNI [waitress.queue][MainThread] Task queue depth is 43
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,557 WARNI [waitress.queue][MainThread] Task queue depth is 44
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,557 WARNI [waitress.queue][MainThread] Task queue depth is 45
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,558 WARNI [waitress.queue][MainThread] Task queue depth is 46
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,558 WARNI [waitress.queue][MainThread] Task queue depth is 47
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,559 WARNI [waitress.queue][MainThread] Task queue depth is 48
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,563 WARNI [waitress.queue][MainThread] Task queue depth is 49
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,564 WARNI [waitress.queue][MainThread] Task queue depth is 50
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,613 WARNI [waitress.queue][MainThread] Task queue depth is 47
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,703 WARNI [waitress.queue][MainThread] Task queue depth is 28
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,781 WARNI [waitress.queue][MainThread] Task queue depth is 3
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,971 WARNI [waitress.queue][MainThread] Task queue depth is 1
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,972 WARNI [waitress.queue][MainThread] Task queue depth is 2
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,972 WARNI [waitress.queue][MainThread] Task queue depth is 3
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,973 WARNI [waitress.queue][MainThread] Task queue depth is 4
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,973 WARNI [waitress.queue][MainThread] Task queue depth is 5
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,974 WARNI [waitress.queue][MainThread] Task queue depth is 5
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,978 WARNI [waitress.queue][MainThread] Task queue depth is 6
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,981 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,983 WARNI [waitress.queue][MainThread] Task queue depth is 8
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,985 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,987 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,989 WARNI [waitress.queue][MainThread] Task queue depth is 8
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,989 WARNI [waitress.queue][MainThread] Task queue depth is 9
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,989 WARNI [waitress.queue][MainThread] Task queue depth is 10
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,990 WARNI [waitress.queue][MainThread] Task queue depth is 11
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,990 WARNI [waitress.queue][MainThread] Task queue depth is 12
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,990 WARNI [waitress.queue][MainThread] Task queue depth is 13
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,991 WARNI [waitress.queue][MainThread] Task queue depth is 14
As a user, it seems ok atm. I timed one of Viz multithread jobs and it got 420 games in a minute with about 950 cores. I guesstimated it should get about 475 so that seems about right.
Edit: I forgot to say; currently 10,100 cores
It's struggling now, 10k cores, but maybe not enough jobs to keep them busy
@xoto10 For your regression test, I think you forgot to update the bench value for master. Fishtest assumes the base branch is current master and defaults to that bench value...please resubmit when you have a chance.
@xoto10 regression test is using the wrong book...
I do get the following error message from time to time on the webpage:
Secure Connection Failed
An error occurred during a connection to tests.stockfishchess.org. PR_END_OF_FILE_ERROR
@glinscott could you post another screenshot of the server load to see if things have been fixed? Thanks!
The logs is nearly unreadable, 5 warning/sec :(
@tomtor updated wiki server setup script with cpulimit
and PROD /root/setup_fishtest.sh
@ppigazzini @tomtor right now might be a good moment to gather some statistics on the fishtest performance under 40000 cores load. It seems more or less OK, but suffering a bit.