official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
278 stars 128 forks source link

Use persistent connections from nginx to backend #1585

Closed noobpwnftw closed 1 year ago

noobpwnftw commented 1 year ago

Looking at deploy script https://github.com/glinscott/fishtest/wiki/Fishtest-server-setup I think you may need keepalive from nginx to python so that you don't create new connection every request.

http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive

ppigazzini commented 1 year ago

Testing this on PROD

upstream backend_tests {
    server 127.0.0.1:6543;
    keepalive 64;
}

upstream backend_all {
    server 127.0.0.1:6544;
    #server 127.0.0.1:6545;
    keepalive 64;
}

    location /api {
        proxy_pass http://backend_tests;

        proxy_http_version 1.1;
        proxy_set_header Connection "";
ppigazzini commented 1 year ago

@noobpwnftw reverted on PROD, to be tested on DEV, problems with SPSA, see

https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1680901335.382071&run_id=

Started game 1 of 12 (New-5880628d16 vs Base-5880628d16)
Started game 3 of 12 (New-5880628d16 vs Base-5880628d16)
Started game 2 of 12 (Base-5880628d16 vs New-5880628d16)
Exception in requests.post():
HTTPSConnectionPool(host='tests.stockfishchess.org', port=443): Max retries exceeded with url: /api/beat (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1002)')))
Exception calling heartbeat:
Post request to https://tests.stockfishchess.org:443/api/beat failed
  Send heartbeat for fd84021c-d8c3-4a7c-b69d-6880a8e56f6e ... Finished game 1 (New-5880628d16 vs Base-5880628d16): 1/2-1/2 {Draw by adjudication}
Score of New-5880628d16 vs Base-5880628d16: 0 - 0 - 1  [0.500] 1
noobpwnftw commented 1 year ago

Probably unrelated, server is under load.

vondele commented 1 year ago

look at the mongodb monitoring, seems like since 7am today, we have 10MB/s network output, and quite high CPU load.

ppigazzini commented 1 year ago

12 SPSA running.

vondele commented 1 year ago

but not since that point in time, they have been running for a while. Maybe there is an exceptional one that started around that time?

ppigazzini commented 1 year ago

The CPU load was high even this morning. Yesterday afternoon I updated PROD, DEV and NET with pyhton 3.11.3 (from 3.11.2) so I'm monitoring the servers (top, journalctl, mutt, mongodb monitoring).

maximmasiutin commented 1 year ago

BTW, we have a nginx opton:

        gzip_static  always;

Do we have .gz version for every static file, on the drive? Otherwise, gzip_static would not work (it only affects web browser since the workers are unlikely to use static assets).

vondele commented 1 year ago

The CPU load was high even this morning. Yesterday afternoon I updated PROD, DEV and NET with pyhton 3.11.3 (from 3.11.2) so I'm monitoring the servers (top, journalctl, mutt, mongodb monitoring).

do you have an explanation for the jump in both CPU and network out at around 7:17 AM ?

I can set all SPSA to prio -1 if that seems worthwhile as a test?

ppigazzini commented 1 year ago

The CPU load was high even this morning. Yesterday afternoon I updated PROD, DEV and NET with pyhton 3.11.3 (from 3.11.2) so I'm monitoring the servers (top, journalctl, mutt, mongodb monitoring).

do you have an explanation for the jump in both CPU and network out at around 7:17 AM ?

I can set all SPSA to prio -1 if that seems worthwhile as a test?

This afternoon the CPU load was ----____---- I supposed that it was due for the high number of SPSA, the journalctl this afternon was clean (only two asserts for a run with cpu<0). Perhaps we could try to pause some SPSA.

@maximmasiutin I view that you have many workers with a key error https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1680894022.894242&run_id=

vondele commented 1 year ago

I'm setting all SPSA to prio -1.

ppigazzini commented 1 year ago

Python itself should return to Nginx that the connection is keep-alive by sending respective header; otherwise Nginx would close connections and this option will have no effect.

I'm not an expert so I need to learn how to check the connections and test on DEV if the new configuration works. I will check the gzip_static too, in the past we served some static content (html pages and at least the highlight.js package) perhaps now that option can be dropped.

noobpwnftw commented 1 year ago

Rule of thumb is to have it on from nginx and check backends later, I think all proper http server frameworks handle it automatically.

maximmasiutin commented 1 year ago

@ppigazzini - did you check that the upstreams honor the keep-alive request header by keeping the connection open while waiting for a new request in the same connections? It may be that the upstreams may close a connection after one request regardless of the "Connection: Keep-Alive" header.

maximmasiutin commented 1 year ago

@maximmasiutin I view that you have many workers with a key error https://tests.stockfishchess.org/actions?max_actions=1&action=&user=&text=&before=1680894022.894242&run_id=

I fixed that already a few minutes ago, sorry!

ppigazzini commented 1 year ago

@ppigazzini - did you check that the upstreams honor the keep-alive request header by keeping the connection open while waiting for a new request in the same connections? It may be that the upstreams may close a connection after one request regardless of the "Connection: Keep-Alive" header.

I searched for "waitress keepalive" and found only this: https://github.com/Pylons/waitress/issues/104

Configure it how? An HTTP 1.0 client may / may not pass the header:
ISTM that waitress does the Right Thing(TM) based on its presence.
noobpwnftw commented 1 year ago

Worry not, it does. Proof: https://github.com/Pylons/waitress/blob/455f2a5fd38cafab95893ca170c7504e4239d985/src/waitress/task.py#L250

maximmasiutin commented 1 year ago

You have to define

 proxy_http_version 1.1;

in Nginx to have keep-alive by default. Unless you define this header, the http version will be 1.0, where no keep-alive exists implicily.

Default: proxy_http_version 1.0;

maximmasiutin commented 1 year ago

@noobpwnftw - you sent a link to a code that handles http 1.1, whilst nginx by default uses http version 1.0 to proxies.

ppigazzini commented 1 year ago

@noobpwnftw - you sent a link to a code that handles http 1.1, whilst nginx by default uses http version 1.0 to proxies.

I wrote the right config, see https://github.com/glinscott/fishtest/issues/1585#issuecomment-1500629013

ppigazzini commented 1 year ago

BTW I reloaded the keepalived config, from top I don't view a change in CPU/Mem load for pserve and nginx processes.

noobpwnftw commented 1 year ago

There shouldn't be much difference. However you can check with: netstat -anp | grep pserve | grep ESTAB | wc -l to count how many active connections are there. And other states like TIME_WAIT and SYN we want to keep them low.

ppigazzini commented 1 year ago

There shouldn't be much difference. However you can check with: netstat -anp | grep pserve | grep ESTAB | wc -l to count how many active connections are there. And other states like TIME_WAIT and SYN we want to keep them low.

$ netstat -anp | grep python3 | grep ESTAB | wc -l
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
138
noobpwnftw commented 1 year ago

This works, something else broke, probably irrelevant here.

ppigazzini commented 1 year ago

With the original config I got 80-100 connections, with keepalive 120-140 connections

noobpwnftw commented 1 year ago

This is expected, old config gives you short connections, with keepalive they don't close but reused.

ppigazzini commented 1 year ago

With a 60s timeout.

noobpwnftw commented 1 year ago

Of inactivity, correct, those are just standard protocol layer things we shouldn't care too much about. It also doesn't break if connections are closed or failed at least as far as nginx is concerned. It'll simply create more connections as before if all existing ones are busy. From backend POV you can explicitly send headers to close some if that's desired as described above.

ppigazzini commented 1 year ago

keepalive 64; seems to be a good number, the connections number lowers with 32 and doesn't raise with 128

maximmasiutin commented 1 year ago

I'm monitoring the servers (top, journalctl, mutt, mongodb monitoring).

Have you been able to identify where it slows down? If there is no high CPU use or disk use, maybe we are at the limit of niginx "worker_connections" paramerter?

do you have somewhere a status page like the following:

location /nginx_status {
    stub_status;
}

?

ppigazzini commented 1 year ago

The VPS runs on OpenVZ in a shared way, the load on the 4 cores makes no sense (python has the GIL so it should load only 1 core at 97%). When noob started this issue the load was lower (between 15-60%) I was monitoring the effect of the new nginx config.

top - 22:41:32 up 79 days, 20:45,  2 users,  load average: 1.53, 1.53, 1.47
Tasks:  35 total,   1 running,  34 sleeping,   0 stopped,   0 zombie
%Cpu0  :  30.3/2.0    32[|||||||||||||||||||||||||||||                                                             ]
%Cpu1  :  24.9/1.7    27[|||||||||||||||||||||||                                                                   ]
%Cpu2  :  21.9/1.0    23[|||||||||||||||||||||                                                                     ]
%Cpu3  :  29.6/1.7    31[||||||||||||||||||||||||||||                                                              ]
KiB Mem : 38.7/5242880  [|||||||||||||||||||||||||||||||||||                                                       ]
KiB Swap:  0.0/0        [                                                                                          ]

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20      224924   5052   3604 S        0.1   0:08.06 /lib/systemd/systemd --system --deserialize 15
    2 root      20                           S                       `- [kthreadd/1988]
    3 root      20                           S              0:00.01      `- [khelper]
  201 syslog    20      189028   1588    744 S        0.0   0:11.58  `- /usr/sbin/rsyslogd -n
  205 root      20       30020    880    716 S        0.0   0:04.55  `- /usr/sbin/cron -f
  206 message+  20       47860    600    176 S        0.0   0:00.37  `- /usr/bin/dbus-daemon --system --address=sys+
  226 root      20       70496   2420   2132 S        0.0   0:02.60  `- /lib/systemd/systemd-logind
  320 root      20      186468      4        S        0.0   0:00.02  `- /usr/bin/python3 /usr/share/unattended-upgr+
  326 root      20       14660      4        S        0.0            `- /sbin/agetty -o -p -- \u --noclear --keep-b+
  332 root      20       13016      4        S        0.0            `- /sbin/agetty -o -p -- \u --noclear tty2 lin+
  361 root      20       72292    536    420 S        0.0   0:00.03  `- /usr/sbin/sshd -D
13828 root      20      101548   6972   6008 S        0.1   0:00.01      `- sshd: fishtest [priv]
13855 fishtest  20      101548   4532   3548 S        0.1   0:02.16          `- sshd: fishtest@pts/0
13856 fishtest  20       21348   5184   3600 S        0.1   0:00.12              `- -bash
20634 fishtest  20       38408   3636   3068 R        0.1   0:00.79                  `- top
15293 root      20      101548   6888   5924 S        0.1   0:00.02      `- sshd: fishtest [priv]
15312 fishtest  20      101656   4280   3264 S        0.1   0:00.95          `- sshd: fishtest@pts/1
15313 fishtest  20       21348   4928   3336 S        0.1   0:00.40              `- -bash
  371 root      20      100980      4        S        0.0   0:00.02  `- /usr/sbin/saslauthd -a pam -c -m /var/run/s+
  377 root      20      100980      4        S        0.0   0:00.01      `- /usr/sbin/saslauthd -a pam -c -m /var/r+
  415 root      20       24180      4        S        0.0            `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -+
  584 Debian-+  20       59456    988    884 S        0.0   0:01.36  `- /usr/sbin/exim4 -bd -q30m
23163 root      20       42100     44        S        0.0   0:01.54  `- /lib/systemd/systemd-udevd
23241 systemd+  20       71716    112     48 S        0.0   0:01.87  `- /lib/systemd/systemd-networkd
23251 root      20      205176  55056  47208 S        1.1   1:44.13  `- /lib/systemd/systemd-journald
13833 fishtest  20       76392   7336   6488 S        0.1            `- /lib/systemd/systemd --user
13834 fishtest  20      254548   2792   1060 S        0.1                `- (sd-pam)
16152 root      20      143192   9304   7304 S        0.2   0:00.03  `- nginx: master process /usr/sbin/nginx -g da+
20047 www-data  20      143376   3904   1764 S        0.1                `- nginx: worker process
20048 www-data  20      147564  11812   5912 S   1.0  0.2   0:06.47      `- nginx: worker process
20049 www-data  20      149612  14268   6096 S   5.0  0.3   0:46.80      `- nginx: worker process
20050 www-data  20      143636   7100   4868 S        0.1   0:00.06      `- nginx: worker process
20532 mongodb   20     3264244 601228  58816 S   3.7 11.5   0:40.31  `- /usr/bin/mongod --config /etc/mongod.conf
20580 fishtest  20     2209704 1.154g  34200 S  97.3 23.1   8:29.37  `- /home/fishtest/fishtest/server/env/bin/pyth+
20581 fishtest  20     1160196 267856  32892 S   6.0  5.1   1:07.11  `- /home/fishtest/fishtest/server/env/bin/pyth+
maximmasiutin commented 1 year ago

If the problem is not in the database queries and the queries are executing fast, than we may consider implementing a client proxy configuration, so that if a person runs multiple machines, that person may configure these machine to use a single proxy with keep-alive to relieve the server from being connected from each machine individually. For example, if I have 10 machines, I may configure them all to use a single proxy with keep-alive.

noobpwnftw commented 1 year ago

So after all we are GIL blocked on that single core, anyways, we'll try to make it do as less work as possible, there is nothing else to fix if there is simply too much work.

maximmasiutin commented 1 year ago

oops, this ia

20580 fishtest 20 2209704 1.154g 34200 S 97.3 23.1 8:29.37 `- /home/fishtest/fishtest/server/env/bin/pyth+

Oops, this is the problem. We may need a profiler to understand where it stays consuming CPU, what id does, where it is not optimal.

Please consider using a profiler to find slow code, maybe we rewrite it.

https://stackoverflow.com/questions/582336/how-do-i-profile-a-python-script

noobpwnftw commented 1 year ago

@ppigazzini Fleet now has 1 worker, and we are back to regular number of cores with all SPSA paused. Why is load not reduced? This is something different from before.

ppigazzini commented 1 year ago

In the last 2 hours the CPU load is raised linearly, we have played only with the nginx config...

image

noobpwnftw commented 1 year ago

If you look at network load, what is it sending all this time except for the dip when it actually dropped all workers for a short while.

ppigazzini commented 1 year ago

I don't understand the linear raise in the last 2 hours.

And in the afternoon the network and the CPU loads had a big jump.

I wonder if is OpenVZ host that is a bit overloaded.

image

maximmasiutin commented 1 year ago

If python app is single-threaded, you can run multiple python process copies to listen on different ports, e.g. run one to listen on 6543, another on 6542, another on 6541, another on 6540, etc., and configure in nginx connection pool:

upstream backend_tests { server 127.0.0.1:6543; server 127.0.0.1:6542; server 127.0.0.1:6541; server 127.0.0.1:6540; keepalive 64; }

So there will be load balancing and you will be able to utilize multiple cores - one core serves on port 6543, another on port 6542, and so on

noobpwnftw commented 1 year ago

The linear increase can be explained: a short moment after you did nginx configs I launched 4k workers, which gradually starts to do their build and run work, as a default fallback when things fail, they do some other work, but it was observed that about 2.5k of the 4k workers went there which quickly drained work so that they have nothing else to do than retrying fishtest, this is why it went linear increase until there is a point where everything failed, after that I have killed 3.999k of the workers so that there is only one left now. What is mystifying is that yesterday we had more load than today, and it did almost fine. And even as of now we have unusually higher CPU and network load.

ppigazzini commented 1 year ago

If python app is single-threaded, you can run multiple python process copies to listen on different ports, e.g. run one to listen on 6543, another on 6542, another on 6541, another on 6540, etc., and configure in nginx connection pool:

https://github.com/glinscott/fishtest/blob/be4894b010d7bdea2037cd54391dd70512bf44ac/server/fishtest/api.py#L18-L28

To make a long story short, we can load balance only the web endpoints (gaining nothing), the config is already done.

upstream backend_tests {
    server 127.0.0.1:6543;
    keepalive 64;
}

upstream backend_all {
    server 127.0.0.1:6544;
    #server 127.0.0.1:6545;
    keepalive 64;
}
noobpwnftw commented 1 year ago

Is it possible that the host server were infected with a cypto miner? We almost certainly did not do anything to create the jump at 7am.

ppigazzini commented 1 year ago

Is it possible that the host server were infected with a cypto miner? We almost certainly did not do anything to create the jump at 7am.

I suspect the host. The CPU and netowrks graphs in the afternoon shoved 2 jumps -----_____-----

Anyway now I rebuild python 3.11.2 (I deleted it just minutes before you opened the issue after monitoring PROD for 24h) and try to switch back. I lowered the MongoDB cache.

ppigazzini commented 1 year ago

The load is dropped! I stopped the build of the old python and reset the usual MongoDB cache.

top - 23:22:36 up 79 days, 21:26,  2 users,  load average: 0.64, 0.76, 0.97
Tasks:  36 total,   1 running,  35 sleeping,   0 stopped,   0 zombie
%Cpu0  :   3.0/0.0     3[|||                                                                                       ]
%Cpu1  :   6.8/0.0     7[||||||                                                                                    ]
%Cpu2  :   1.5/0.0     2[|                                                                                         ]
%Cpu3  :   3.0/0.8     4[||||                                                                                      ]
KiB Mem : 37.5/5242880  [||||||||||||||||||||||||||||||||||                                                        ]
KiB Swap:  0.0/0        [                                                                                          ]

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20      224924   5052   3604 S        0.1   0:08.08 /lib/systemd/systemd --system --deserialize 15
    2 root      20                           S                       `- [kthreadd/1988]
    3 root      20                           S              0:00.01      `- [khelper]
  201 syslog    20      189028   1588    744 S        0.0   0:11.67  `- /usr/sbin/rsyslogd -n
  205 root      20       30020    880    716 S        0.0   0:04.56  `- /usr/sbin/cron -f
  206 message+  20       47860    600    176 S        0.0   0:00.37  `- /usr/bin/dbus-daemon --system --address=sys+
  226 root      20       70496   2420   2132 S        0.0   0:02.60  `- /lib/systemd/systemd-logind
  320 root      20      186468      4        S        0.0   0:00.02  `- /usr/bin/python3 /usr/share/unattended-upgr+
  326 root      20       14660      4        S        0.0            `- /sbin/agetty -o -p -- \u --noclear --keep-b+
  332 root      20       13016      4        S        0.0            `- /sbin/agetty -o -p -- \u --noclear tty2 lin+
  361 root      20       72292    536    420 S        0.0   0:00.03  `- /usr/sbin/sshd -D
13828 root      20      101548   6972   6008 S        0.1   0:00.01      `- sshd: fishtest [priv]
13855 fishtest  20      101548   4532   3548 S   0.8  0.1   0:02.44          `- sshd: fishtest@pts/0
13856 fishtest  20       21348   5184   3600 S        0.1   0:00.13              `- -bash
21616 fishtest  20       38432   3508   3044 R   1.5  0.1   0:02.32                  `- top
15293 root      20      101548   6888   5924 S        0.1   0:00.03      `- sshd: fishtest [priv]
15312 fishtest  20      101656   4280   3264 S        0.1   0:01.00          `- sshd: fishtest@pts/1
15313 fishtest  20       21348   4960   3336 S        0.1   0:00.50              `- -bash
22882 fishtest  20        8552   2428   2216 S        0.0                            `- less /etc/nginx/sites-enabl+
  371 root      20      100980      4        S        0.0   0:00.02  `- /usr/sbin/saslauthd -a pam -c -m /var/run/s+
  377 root      20      100980      4        S        0.0   0:00.01      `- /usr/sbin/saslauthd -a pam -c -m /var/r+
  415 root      20       24180      4        S        0.0            `- /usr/sbin/xinetd -pidfile /run/xinetd.pid -+
  584 Debian-+  20       59456    988    884 S        0.0   0:01.36  `- /usr/sbin/exim4 -bd -q30m
23163 root      20       42100     44        S        0.0   0:01.54  `- /lib/systemd/systemd-udevd
23241 systemd+  20       71716    112     48 S        0.0   0:01.87  `- /lib/systemd/systemd-networkd
23251 root      20      205176  56548  48692 S        1.1   1:44.59  `- /lib/systemd/systemd-journald
13833 fishtest  20       76392   7336   6488 S        0.1            `- /lib/systemd/systemd --user
13834 fishtest  20      254548   2792   1060 S        0.1                `- (sd-pam)
16152 root      20      143192   9304   7304 S        0.2   0:00.03  `- nginx: master process /usr/sbin/nginx -g da+
20047 www-data  20      143376   3904   1764 S        0.1                `- nginx: worker process
20048 www-data  20      147564  12332   6320 S        0.2   0:15.31      `- nginx: worker process
20049 www-data  20      150352  15188   6456 S   3.8  0.3   2:20.07      `- nginx: worker process
20050 www-data  20      143636   7100   4868 S        0.1   0:00.10      `- nginx: worker process
22454 mongodb   20     3229620 550792  60800 S   0.8 10.5   0:40.26  `- /usr/bin/mongod --config /etc/mongod.conf
22502 fishtest  20     2169080 1.103g  33004 S  10.6 22.1   4:02.37  `- /home/fishtest/fishtest/server/env/bin/pyth+
22503 fishtest  20     1179660 311116  32940 S        5.9   0:43.96  `- /home/fishtest/fishtest/server/env/bin/pyth+
noobpwnftw commented 1 year ago

So we got hit by mixing python versions again due to pyobjcache and stuff?

ppigazzini commented 1 year ago

No, we never got mixed python versions, the script deletes the old folder before installing fishtest with the new python version.

cd fishtest git config user.email 'you@example.com' git config user.name 'your_name'


- wiki script (simplified respect to PROD one)
https://github.com/glinscott/fishtest/wiki/Fishtest-server-setup#test-a-prbranch
ppigazzini commented 1 year ago

And I changed nothing on the server to get now a low load on CPU and network.

ppigazzini commented 1 year ago

The network load raise was only for bytes out.

image

maximmasiutin commented 1 year ago

We have a "run_cache" in rundb.py, that's why we must have a single instance.

    # Cache runs
    run_cache = {}
    run_cache_lock = threading.Lock()
    run_cache_write_lock = threading.Lock()

And pserve has a single instance serving the requests sequentially. We can configure uWSGI to serve requests in parallel to utilize multiple threads, however, in this case, we ill have to use a distributed cache API such as uWSGI caching framework or memcached. I can implement pymemcache and test the server locally if you wish, you install it to run via uWSGI / gunicorn from Nginx instead of from pserve. What do you think?

ppigazzini commented 1 year ago

We can test on DEV. But we have another external limit. The VPS is hosted in a oversold OpenVZ host. If we use too much CPU the ISP shutdowns the VPS. During the MongoDB backup we use cpulimit, dump sequentially the collections, upload sequentially on aws.

fishtest@tests:~$ crontab -l
VENV=/home/fishtest/fishtest/server/env
UPATH=/home/fishtest/fishtest/server/utils

# Backup mongodb database every 6 hours and upload to s3
3 */6 * * * /usr/bin/cpulimit -l 50 -f -m -- sh ${UPATH}/backup.sh

https://github.com/glinscott/fishtest/blob/be4894b010d7bdea2037cd54391dd70512bf44ac/server/utils/backup.sh#L12-L20