official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

fishtest at 40k cores #553

Closed vondele closed 2 years ago

vondele commented 4 years ago

@ppigazzini @tomtor right now might be a good moment to gather some statistics on the fishtest performance under 40000 cores load. It seems more or less OK, but suffering a bit.

ppigazzini commented 4 years ago
top - 21:54:13 up 40 days, 21:37,  1 user,  load average: 2.05, 2.04, 1.97
Tasks:  31 total,   2 running,  29 sleeping,   0 stopped,   0 zombie
%Cpu0  :  34.2/1.7    36[||||||||||||||||||||||||||||||||||||                                                                ]
%Cpu1  :  39.1/2.0    41[|||||||||||||||||||||||||||||||||||||||||                                                           ]
%Cpu2  :  23.0/1.3    24[||||||||||||||||||||||||                                                                            ]
%Cpu3  :  33.2/1.3    35[||||||||||||||||||||||||||||||||||                                                                  ]
GiB Mem : 67.1/5.000    [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                 ]
GiB Swap:  0.0/0.000    [                                                                                                    ]

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20      225020   3120   1584 S        0.1   0:20.24 init -z
    2 root      20                           S                       `- [kthreadd/1988]
    3 root      20                           S                           `- [khelper]
   57 root      20      346764 189008 178700 S   1.0  3.6   4:12.98  `- /lib/systemd/systemd-journald
  167 root      20       42092    472    424 S        0.0   0:03.27  `- /lib/systemd/systemd-udevd
  172 systemd+  20       71848    404    336 S        0.0   0:03.79  `- /lib/systemd/systemd-networkd
  293 root      20       28344    536    436 S        0.0   0:10.87  `- /usr/sbin/cron -f
  307 message+  20       47748   1060    628 S        0.0   0:01.70  `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  310 root      20       62144    792    616 S        0.0   0:04.77  `- /lib/systemd/systemd-logind
  315 syslog    20      189016   1624    444 S   0.3  0.0   0:26.64  `- /usr/sbin/rsyslogd -n
  318 root      20       72288    616    504 S        0.0   0:00.04  `- /usr/sbin/sshd -D
28685 root      20       97180   4092   3148 S        0.1   0:00.01      `- sshd: fishtest [priv]
28709 fishtest  20       97180   1872    928 S        0.0                    `- sshd: fishtest@pts/0
28710 fishtest  20       18632   2128   1580 S        0.0                        `- -bash
28726 fishtest  20       36752   1936   1380 R        0.0   0:00.13                  `- top
  334 root      20      184296   1576   1576 S        0.0   0:00.10  `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
  356 root      20       24176    300    296 S        0.0            `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
  573 root      20      100972    240    236 S        0.0            `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  574 root      20      100972      4        S        0.0                `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  583 Debian-+  20       59180    500    404 S        0.0   0:02.90  `- /usr/sbin/exim4 -bd -q30m
18349 root      20      142520   1528    944 S        0.0   0:00.01  `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
  520 www-data  20      147488   8224   2428 R   4.3  0.2 433:45.79      `- nginx: worker process
  521 www-data  20      144872   5172   2360 S        0.1  11:23.62      `- nginx: worker process
  522 www-data  20      143724   3576   2220 S        0.1   0:01.50      `- nginx: worker process
  523 www-data  20      143344   3204   2076 S        0.1   0:00.37      `- nginx: worker process
22838 uuidd     20       25192    188    188 S        0.0            `- /usr/sbin/uuidd --socket-activation
21164 mongodb   20     3385516 1.672g   6064 S  30.9 33.4 201:36.49  `- /usr/bin/mongod --config /etc/mongod.conf
21206 fishtest  20     2472712 1.092g   5172 S  99.3 21.8 911:46.03  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
21207 fishtest  20     2035064 575920   4216 S   3.0 11.0  21:56.88  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
24529 root      20       13008    840    712 S        0.0            `- /sbin/agetty -o -p -- \u --noclear tty2 linux
24531 root      20       13008    844    716 S        0.0            `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
vondele commented 4 years ago

So this one is at 100%: 21206 fishtest 20 2472712 1.092g 5172 S 99.3 21.8 911:46.03 `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543

ppigazzini commented 4 years ago
upstream backend_tests {
    server 127.0.0.1:6543;
}

upstream backend_all {
    server 127.0.0.1:6544;
}

server {
    listen 80;
    listen [::]:80;

    server_name tests.stockfishchess.org;

    location / {
        return 301 https://$host$request_uri;
    }

    location /api {
        proxy_pass http://backend_tests;

        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Port $server_port;

        client_max_body_size        10m;
        client_body_buffer_size     128k;
        proxy_connect_timeout       60s;
        proxy_send_timeout          90s;
        proxy_read_timeout          90s;
        proxy_buffering             off;
        proxy_temp_file_write_size  64k;
        proxy_redirect              off;

        location ~ ^/api/(active_runs|download_pgn|download_pgn_100|request_version|upload_pgn) {
            proxy_pass http://backend_all;
        }
    }
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2 ipv6only=on;
    ssl_certificate /etc/letsencrypt/live/tests.stockfishchess.org/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/tests.stockfishchess.org/privkey.pem;
    include /etc/letsencrypt/options-ssl-nginx.conf;
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;

    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

    server_name tests.stockfishchess.org;

    location ~ ^/(css|html|img|js|favicon.ico|robots.txt) {
        root        /home/fishtest/fishtest/fishtest/fishtest/static;
        expires     1y;
        add_header  Cache-Control public;
        access_log  off;
    }
    location / {
        proxy_pass http://backend_all;

        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Port $server_port;

        client_max_body_size        10m;
        client_body_buffer_size     128k;
        proxy_connect_timeout       60s;
        proxy_send_timeout          90s;
        proxy_read_timeout          90s;
        proxy_buffering             off;
        proxy_temp_file_write_size  64k;
        proxy_redirect              off;

        location ~ ^/api/(active_runs|download_pgn|download_pgn_100|request_version|upload_pgn) {
            proxy_pass http://backend_all;
        }

        location /api/ {
            proxy_pass http://backend_tests;
        }

        location /tests/user/ {
            proxy_pass http://backend_all;
        }

        location /tests {
            proxy_pass http://backend_tests;
        }
    }
}
ppigazzini commented 4 years ago

@vondele api served by backend_test are stateful, so we cannot add (easily) other backend processes.

vondele commented 4 years ago

guess we have just to get some passes for these STC patches and turn them into LTC, and things will ease on the server :-)

having said that, that's thus the single process bottleneck. I guess there is not much of low hanging fruit there.

noobpwnftw commented 4 years ago

Is it possible to load balance by test? Should not need global lock for different tests.

ppigazzini commented 4 years ago

Reading https://github.com/glinscott/fishtest/issues/133#issuecomment-526931028 perhaps we could try to use nginx hash to load balancing the id_tests on some processes http://nginx.org/en/docs/http/ngx_http_upstream_module.html#hash

noobpwnftw commented 4 years ago

After turning off 2000 new workers, it seems more stable now. Currently there are 1748 new workers running, except for the regular ones. New workers have 4 cores each.

vondele commented 4 years ago

@ppigazzini can you post another top ?

ppigazzini commented 4 years ago
top - 22:39:08 up 40 days, 22:22,  1 user,  load average: 2.09, 1.83, 1.86
Tasks:  31 total,   2 running,  29 sleeping,   0 stopped,   0 zombie
%Cpu0  :  41.3/3.3    45[||||||||||||||||||||||||||||||||||||||||||||                                                        ]
%Cpu1  :  10.3/1.0    11[|||||||||||                                                                                         ]
%Cpu2  :  48.7/2.0    51[|||||||||||||||||||||||||||||||||||||||||||||||||||                                                 ]
%Cpu3  :   7.0/0.7     8[||||||||                                                                                            ]
GiB Mem : 69.6/5.000    [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                              ]
GiB Swap:  0.0/0.000    [                                                                                                    ]

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20      225020   3176   1640 S        0.1   0:20.26 init -z
    2 root      20                           S                       `- [kthreadd/1988]
    3 root      20                           S                           `- [khelper]
   57 root      20      203012  80348  70032 S   0.3  1.5   4:53.95  `- /lib/systemd/systemd-journald
  167 root      20       42092    416    368 S        0.0   0:03.29  `- /lib/systemd/systemd-udevd
  172 systemd+  20       71848    312    244 S        0.0   0:03.80  `- /lib/systemd/systemd-networkd
  293 root      20       28344    536    436 S        0.0   0:10.88  `- /usr/sbin/cron -f
  307 message+  20       47748   1084    652 S        0.0   0:01.70  `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  310 root      20       62144    896    720 S        0.0   0:04.78  `- /lib/systemd/systemd-logind
  315 syslog    20      189016   1784    596 S        0.0   0:41.38  `- /usr/sbin/rsyslogd -n
  318 root      20       72288    544    432 S        0.0   0:00.04  `- /usr/sbin/sshd -D
28685 root      20       97180   1624    680 S        0.0   0:00.01      `- sshd: fishtest [priv]
28709 fishtest  20       97180   1376    432 S        0.0   0:00.18          `- sshd: fishtest@pts/0
28710 fishtest  20       18632   1380    800 S        0.0                        `- -bash
 1697 fishtest  20       36620   1824   1280 R   0.3  0.0   0:00.04                  `- top
  334 root      20      184296   1576   1576 S        0.0   0:00.10  `- /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
  356 root      20       24176    300    296 S        0.0            `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
  573 root      20      100972    240    236 S        0.0            `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  574 root      20      100972      4        S        0.0                `- /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
  583 Debian-+  20       59180    500    404 S        0.0   0:02.90  `- /usr/sbin/exim4 -bd -q30m
18349 root      20      142520   1524    940 S        0.0   0:00.01  `- nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
  520 www-data  20      147788   8420   2460 S   7.6  0.2 436:27.96      `- nginx: worker process
  521 www-data  20      144876   5320   2392 S   0.3  0.1  11:53.64      `- nginx: worker process
  522 www-data  20      143724   3564   2208 S        0.1   0:01.72      `- nginx: worker process
  523 www-data  20      143344   3196   2068 S        0.1   0:00.41      `- nginx: worker process
22838 uuidd     20       25192    188    188 S        0.0            `- /usr/sbin/uuidd --socket-activation
21164 mongodb   20     3385516 1.670g   5816 S   6.0 33.4 215:12.52  `- /usr/bin/mongod --config /etc/mongod.conf
21206 fishtest  20     2583040 1.201g   5140 R 101.7 24.0 956:00.69  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6543
21207 fishtest  20     2035320 593664   4064 S   1.3 11.3  23:34.29  `- /home/fishtest/fishtest/fishtest/env/bin/python3 /home/fishtest/fishtest/fishtest/env/bin/pserve production.ini http_port=6544
31562 root      20       13008    844    712 S        0.0            `- /sbin/agetty -o -p -- \u --noclear tty2 linux
31563 root      20       13008    848    716 S        0.0            `- /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
crossbr commented 4 years ago

Before noob turned off the 2000, my worker was producing the following:

Worker started in C:\fishtest-master\worker ...

Memory: 34287968256 Worker version 72 connecting to https://tests.stockfishchess.org:443 Fetch task... Exception accessing host: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

crossbr commented 4 years ago

And the fishtest page gives the following: Fishtest_error

noobpwnftw commented 4 years ago

There is a tipping point for the current server, past that the workers will timeout and the effect cascades. Looks like somewhere around 18K cores.

noobpwnftw commented 4 years ago

Now it seems down, I turned off all new workers, but it does not recover.

Vizvezdenec commented 4 years ago

Yeah server is dead now.

ppigazzini commented 4 years ago

The VPS seems down, I wrote to @glinscott asking to check, finger crossed.

tomtor commented 4 years ago

The VPS seems down, I wrote to @glinscott asking to check, finger crossed.

Probably reached a network limit to guard against DOS-attacks

Reading #133 (comment) perhaps we could try to use nginx hash to load balancing the id_tests on some processes http://nginx.org/en/docs/http/ngx_http_upstream_module.html#hash

See https://stackoverflow.com/questions/31994395/how-to-use-url-pathname-as-upstream-hash-in-nginx

We would have to add the test_id as additional URI parameter (...?5e8909804411759d9d099a78) in the update call from the worker and also to live_elo scripts and other query pages. That would allow distributing worker updates/queries over pserv instances.

Problem remains handing out new jobs to worker requests. If a test terminates then the single central work allocator has to be notified of this. It could more frequently poll MongoDb (which is updated when a test finishes) as a simple implementation.

xoto10 commented 4 years ago

Is fishtest able to tell when the server is very busy?

It would be nice if we had some default unimportant (or just null) LTC job that we could give to workers if the server was overloaded, that would maybe ease the problems before it goes too far.

I've done this manually before, and could have done last night, but it didn't occur to me until too late.

glinscott commented 4 years ago

It looks like we might have exceeded the CPU load.

image

glinscott commented 4 years ago

I've rebooted the server, it should be coming back up now.

ppigazzini commented 4 years ago

Logs

$ sudo journalctl -u fishtest@6543 -n 1000

Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,838 WARNI [waitress.queue][MainThread] Task queue depth is 427
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,839 WARNI [waitress.queue][MainThread] Task queue depth is 428
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,843 WARNI [waitress.queue][MainThread] Task queue depth is 428
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,846 WARNI [waitress.queue][MainThread] Task queue depth is 429
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,847 WARNI [waitress.queue][MainThread] Task queue depth is 430
Apr 05 02:05:55 tests.stockfishchess.org pserve[21206]: 2020-04-05 02:05:55,849 WARNI [waitress.queue][MainThread] Task queue depth is 431
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Starting server in PID 21206.
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Serving on http://localhost:6543
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: JojoM
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: tvijlbrief
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: ChessDBCN
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: 0x3C33
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: ChessDBCN
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: drabel
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: JojoM
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: yurikvelo
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: dew
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: Update_task: Non matching username: ChessDBCN
Apr 05 02:05:56 tests.stockfishchess.org pserve[21206]: flush
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: .......Traceback (most recent call last):
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/bin/pserve", line 8, in <module>
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     sys.exit(main())
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pyramid/scripts/pserve.py", line 34, in main
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     return command.run()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pyramid/scripts/pserve.py", line 285, in run
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     server(app)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/paste/deploy/loadwsgi.py", line 195, in server_wrapper
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     **context.local_conf)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/paste/deploy/util.py", line 55, in fix_call
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     val = callable(*args, **kw)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/__init__.py", line 22, in serve_paste
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     serve(app, **kw)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/__init__.py", line 18, in serve
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     server.run()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/server.py", line 332, in run
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     use_poll=self.adj.asyncore_use_poll,
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/wasyncore.py", line 247, in loop
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     poll_fun(timeout, map)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/waitress/wasyncore.py", line 174, in poll
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     r, w, e = select.select(r, w, e, timeout)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/fishtest/rundb.py", line 163, in exit_run
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     last_rundb.flush_all()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/fishtest/rundb.py", line 220, in flush_all
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     self.runs.save(self.run_cache[r_id]['run'])
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 3163, in save
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     collation=collation)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 856, in _update_retryable
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     _update, session)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1491, in _retryable_write
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     return self._retry_with_session(retryable, func, s, None)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     return func(session, sock_info, retryable)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 852, in _update
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     retryable_write=retryable_write)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/collection.py", line 822, in _update
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     retryable_write=retryable_write).copy()
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/pool.py", line 618, in command
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     self._raise_connection_failure(error)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/pool.py", line 613, in command
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     user_fields=user_fields)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/network.py", line 167, in command
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     parse_write_concern_error=parse_write_concern_error)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:   File "/home/fishtest/fishtest/fishtest/env/lib/python3.6/site-packages/pymongo/helpers.py", line 136, in _check_command_response
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]:     raise NotMasterError(errmsg, response)
Apr 05 02:05:58 tests.stockfishchess.org pserve[21206]: pymongo.errors.NotMasterError: interrupted at shutdown
$ sudo journalctl -u fishtest@6544 -n 1000

Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,038 WARNI [waitress.queue][MainThread] Task queue depth is 5
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,038 WARNI [waitress.queue][MainThread] Task queue depth is 6
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,038 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,039 WARNI [waitress.queue][MainThread] Task queue depth is 8
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,039 WARNI [waitress.queue][MainThread] Task queue depth is 9
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,040 WARNI [waitress.queue][MainThread] Task queue depth is 10
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,040 WARNI [waitress.queue][MainThread] Task queue depth is 11
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,041 WARNI [waitress.queue][MainThread] Task queue depth is 12
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,041 WARNI [waitress.queue][MainThread] Task queue depth is 13
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,042 WARNI [waitress.queue][MainThread] Task queue depth is 14
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,042 WARNI [waitress.queue][MainThread] Task queue depth is 15
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,043 WARNI [waitress.queue][MainThread] Task queue depth is 16
Apr 04 23:52:06 tests.stockfishchess.org pserve[21207]: 2020-04-04 23:52:06,046 WARNI [waitress.queue][MainThread] Task queue depth is 17
Apr 05 02:05:55 tests.stockfishchess.org systemd[1]: Stopping Fishtest Server port 6544...
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Invalid login: "KatyaTal" "69#dzbHot2J"
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Invalid login: "Craigbob" "k9A1%r1vpdU"
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Starting server in PID 21207.
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: Serving on http://localhost:6544
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: ['missing-input-response']
Apr 05 02:05:55 tests.stockfishchess.org pserve[21207]: flush
$ sudo journalctl -u mongod -n 100

Mar 02 17:56:54 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Apr 01 03:01:00 tests.stockfishchess.org systemd[1]: Stopping MongoDB Database Server...
Apr 01 03:01:11 tests.stockfishchess.org systemd[1]: Stopped MongoDB Database Server.
Apr 01 03:01:11 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Apr 05 02:05:55 tests.stockfishchess.org systemd[1]: Stopping MongoDB Database Server...
Apr 05 02:06:36 tests.stockfishchess.org systemd[1]: Stopped MongoDB Database Server.
Apr 05 16:33:36 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
vondele commented 4 years ago

CPU load seems to get very high every hour or so. @ppigazzini could that be a regular backup (coincides with outgoing traffic) or so, possibly even multithreaded? If so, we could try to reduce the number of threads for that process and/or renice/cpulimit ?

ppigazzini commented 4 years ago

It's the hourly mongodb backup.

On DEV:

Tried:

vondele commented 4 years ago

so that seems one way of keeping this under control. Maybe even -l 50. Eventually we have to make sure this backup is finished before the next one starts... don't know how we have this scheduled.

tomtor commented 4 years ago

A backup every 3 hours combined with "-l 50" would be sufficient?

vondele commented 4 years ago

I think that would be OK.... BTW, I don't know mongodb, is the db locked during this dumping time?

tomtor commented 4 years ago

I think that would be OK.... BTW, I don't know mongodb, is the db locked during this dumping time?

No it remains writable/queryable

vondele commented 4 years ago

so I think we should go to cpulimit -l 50 -- mongodump ... and do it every 3h.

ppigazzini commented 4 years ago

I'm playing a bit with the systemd config for cron.service https://askubuntu.com/questions/71211/how-to-use-cpulimit-on-all-cron-jobs This is not able to limit the CPU usage when the cores are unloaded, though.

noobpwnftw commented 4 years ago

cron tasks, in theory, will not execute if the previous one is still in progress. VPS went down after a while operating normally with reduced workers, so it should be hitting other limits.

How about locking the mongo database with an array of locks mod task id, instead of one?

tomtor commented 4 years ago

How about locking the mongo database with an array of locks mod task id, instead of one?

@noobpwnftw I do not think we have a MongoDb bottleneck.

noobpwnftw commented 4 years ago

@noobpwnftw I do not think we have a MongoDb bottleneck.

No we don't, but the Python server code takes one global lock for updating all tasks. I was suggesting if we can use a more granular locking scheme there.

tomtor commented 4 years ago

@noobpwnftw I do not think we have a MongoDb bottleneck.

No we don't, but the Python server code takes one global lock for updating all tasks. I was suggesting if we can use a more granular locking scheme there.

I am pretty sure we have a lock per test stored in a map indexed on id.

Edit: note that this will not increase throughput because only one Python thread is active at a time. We would have to rewrite the code after these locks in Rust or C and release the GIL to make use of the multiple locks.

ppigazzini commented 4 years ago

On DEV I tested these 2 solutions (both take 11 minutes to finish), let me know your preference:

  1. cpulimit in crontab: this should affect also the backup speed on s3 (commented out on DEV, not tested)
    3 * * * * /usr/bin/cpulimit -l 50 -f -m -- sh ${UPATH}/backup.sh
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20      159240   3560   2268 S        0.1   0:01.86 init -z
    2 root      20                           S                       `- [kthreadd/13518]
    3 root      20                           S                           `- [khelper]
    59 root      20      111784   6352   5368 S        0.1   0:02.56  `- /lib/systemd/systemd-journald
    166 root      20       42092    992    940 S        0.0   0:00.07  `- /lib/systemd/systemd-udevd
    169 systemd+  20       71848   1132   1064 S        0.0   0:00.09  `- /lib/systemd/systemd-networkd
    181 message+  20       47620   1272    960 S        0.0   0:00.12  `- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
    190 root      20       28344   1132    980 S        0.0   0:00.19  `- /usr/sbin/cron -f
    24762 root      20       45140   1332    988 S        0.0                `- /usr/sbin/CRON -f
    24763 usr00     20        4616    656    568 S        0.0                    `- /bin/sh -c /usr/bin/cpulimit -l 50 -f -m -- sh /home/usr00/__test/test_backup.sh
    24764 usr00     20       82508    676    564 S   0.3  0.0   0:00.02              `- /usr/bin/cpulimit -l 50 -f -m -- sh /home/usr00/__test/test_backup.sh
    24765 usr00     20        4616    664    572 S        0.0                            `- sh /home/usr00/__test/test_backup.sh
    24767 usr00     20     1289292 206324   4996 T  63.7  3.3   0:07.72                      `- mongodump --db=fishtest_new --excludeCollection=pgns --gzip
    24787 usr00     20       82508    704    592 S   0.3  0.0   0:00.02                  `- /usr/bin/cpulimit -l 50 -p 24767 -z -m
  2. cpulimit in backup.sh: here is possible to backup on s3 at max speed
    
    for db_name in "fishtest_new" "admin" "fishtest" "fishtest_testing"; do
    cpulimit -l 50 -m -f -- mongodump --db=${db_name} --excludeCollection="pgns" --gzip
    done
    tar -cv dump | cpulimit -l 50 -m -f -- gzip -1 > dump.tar.gz
    rm -rf dump

date_utc=$(date +%Y%m%d --utc) ${VENV}/bin/aws s3 cp dump.tar.gz s3://fishtest/backup/archive/${date_utc}/dump.tar.gz

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 159240 3556 2264 S 0.1 0:01.85 init -z 2 root 20 S - [kthreadd/13518] 3 root 20 S- [khelper] 59 root 20 111784 14360 13388 S 0.2 0:02.52 - /lib/systemd/systemd-journald 166 root 20 42092 992 940 S 0.0 0:00.07- /lib/systemd/systemd-udevd 169 systemd+ 20 71848 1132 1064 S 0.0 0:00.09 - /lib/systemd/systemd-networkd 181 message+ 20 47620 1272 960 S 0.0 0:00.12- /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only 190 root 20 28344 1132 980 S 0.0 0:00.18 - /usr/sbin/cron -f 24600 root 20 45140 1332 988 S 0.0- /usr/sbin/CRON -f 24601 usr00 20 4616 660 572 S 0.0 - /bin/sh -c sh /home/usr00/__test/test_backup.sh 24602 usr00 20 4616 660 572 S 0.0- sh /home/usr00/__test/test_backup.sh 24603 usr00 20 82508 680 564 S 0.3 0.0 0:00.10 - cpulimit -l 50 -m -f -- mongodump --db=fishtest_new --excludeCollection=pgns --gzip 24604 usr00 20 1289292 207572 5144 R 75.7 3.3 0:24.70- mongodump --db=fishtest_new --excludeCollection=pgns --gzip

vondele commented 4 years ago

my 2 cents... option 1 is simpler and robust. I'm actually not sure an IO bound task will be much influenced by cpulimit, so transfer rates might be still high. And even if the rates are a bit lower, that's actually not such a bad thing if the outgoing bandwidth is spiky as well.

ppigazzini commented 4 years ago

@vondele I tried with a download script and the transfer rate seems unaffected. Here is how cpulimit works "Basically, the target process, which you can specify by pid, name, or command line, is continuosly paused and resumed by sending it SIGSTOP and SIGCONT signals."

ppigazzini commented 4 years ago

PROD updated with option 1

vondele commented 4 years ago

so we would be ready for some testing by @noobpwnftw ? Maybe gentle to start with ;-)

noobpwnftw commented 4 years ago

Sure, gradually increasing workers.

ppigazzini commented 4 years ago

Backup using option 1 successfully finished in 14 minutes.

noobpwnftw commented 4 years ago

It now shows a few 500 internal server error when I refresh.

ppigazzini commented 4 years ago

From sudo journalctl -u fishtets@6543 -f "Task queue depth" continually swing between 1 and 60-110.

Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,543 WARNI [waitress.queue][MainThread] Task queue depth is 42
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,543 WARNI [waitress.queue][MainThread] Task queue depth is 43
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,543 WARNI [waitress.queue][MainThread] Task queue depth is 44
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,547 WARNI [waitress.queue][MainThread] Task queue depth is 44
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,556 WARNI [waitress.queue][MainThread] Task queue depth is 43
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,557 WARNI [waitress.queue][MainThread] Task queue depth is 44
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,557 WARNI [waitress.queue][MainThread] Task queue depth is 45
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,558 WARNI [waitress.queue][MainThread] Task queue depth is 46
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,558 WARNI [waitress.queue][MainThread] Task queue depth is 47
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,559 WARNI [waitress.queue][MainThread] Task queue depth is 48
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,563 WARNI [waitress.queue][MainThread] Task queue depth is 49
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,564 WARNI [waitress.queue][MainThread] Task queue depth is 50
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,613 WARNI [waitress.queue][MainThread] Task queue depth is 47
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,703 WARNI [waitress.queue][MainThread] Task queue depth is 28
Apr 07 00:38:43 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:43,781 WARNI [waitress.queue][MainThread] Task queue depth is 3
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,971 WARNI [waitress.queue][MainThread] Task queue depth is 1
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,972 WARNI [waitress.queue][MainThread] Task queue depth is 2
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,972 WARNI [waitress.queue][MainThread] Task queue depth is 3
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,973 WARNI [waitress.queue][MainThread] Task queue depth is 4
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,973 WARNI [waitress.queue][MainThread] Task queue depth is 5
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,974 WARNI [waitress.queue][MainThread] Task queue depth is 5
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,978 WARNI [waitress.queue][MainThread] Task queue depth is 6
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,981 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,983 WARNI [waitress.queue][MainThread] Task queue depth is 8
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,985 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,987 WARNI [waitress.queue][MainThread] Task queue depth is 7
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,989 WARNI [waitress.queue][MainThread] Task queue depth is 8
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,989 WARNI [waitress.queue][MainThread] Task queue depth is 9
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,989 WARNI [waitress.queue][MainThread] Task queue depth is 10
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,990 WARNI [waitress.queue][MainThread] Task queue depth is 11
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,990 WARNI [waitress.queue][MainThread] Task queue depth is 12
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,990 WARNI [waitress.queue][MainThread] Task queue depth is 13
Apr 07 00:38:53 tests.stockfishchess.org pserve[23325]: 2020-04-07 00:38:53,991 WARNI [waitress.queue][MainThread] Task queue depth is 14
xoto10 commented 4 years ago

As a user, it seems ok atm. I timed one of Viz multithread jobs and it got 420 games in a minute with about 950 cores. I guesstimated it should get about 475 so that seems about right.

Edit: I forgot to say; currently 10,100 cores

xoto10 commented 4 years ago

It's struggling now, 10k cores, but maybe not enough jobs to keep them busy

31m059 commented 4 years ago

@xoto10 For your regression test, I think you forgot to update the bench value for master. Fishtest assumes the base branch is current master and defaults to that bench value...please resubmit when you have a chance.

vondele commented 4 years ago

@xoto10 regression test is using the wrong book...

vondele commented 4 years ago

I do get the following error message from time to time on the webpage:

Secure Connection Failed

An error occurred during a connection to tests.stockfishchess.org. PR_END_OF_FILE_ERROR
vondele commented 4 years ago

@glinscott could you post another screenshot of the server load to see if things have been fixed? Thanks!

ppigazzini commented 4 years ago

The logs is nearly unreadable, 5 warning/sec :(

ppigazzini commented 4 years ago

@tomtor updated wiki server setup script with cpulimit and PROD /root/setup_fishtest.sh