Stress and performance test #8

Open alexander-haller opened 11 months ago

In GitLab by @mhxion on Dec 1, 2023, 24:02

I did some digging into why the responses are slower all on a sudden.

The timeout issue can be solved with just increasing the timeout to 180s (previously it was 60s). In itself, this isn't a big issue. But the following might be related to it.
The responses are slower than before. Querying 2500 users from dev-002 takes 6-7 minutes where just a week ago it was taking 4-5 minutes. Querying 600+ users from production also takes 5-6 minutes where it should take at most 2 minutes. These timings remain the same even when I run bill-teams on the servers through SSH!

Monitoring server load

Since no. 2 wasn't a latency issue, I wanted look into the server load. This can also count as a minimal stress test.

When I run bill-teams targeting dev-002 on my own machine and monitor htop stat on dev-002 through SSH, I get the following results. Worth noting, the slow responses mainly come from when we make asynchronous requests to get every user data.

i. Making 100 concurrent connections (so that's 100 concurrent GET requests to the user endpoint) gives us the following the htop change. The number 188 for threads make sense (before starting bill-teams it was about 86 threads). That many concurrent connections give us 100% CPU usage. This is the default number of concurrent connections of elAPI which gave us really satisfying performance before on dev-002.

ii. Making 5 concurrent connections gives us the following htop.

That also causes almost 100% CPU usage spike. In fact, having 100 concurrent connections doesn't improve the overall timing by that much. I should have done this test long ago so I would have some more data to compare :/ I am not sure how bad this is.

In GitLab by @mhxion on Dec 1, 2023, 02:29

Querying 600+ users from production also takes 5-6 minutes where it should take at most 2 minutes.

Actually, thinking it over it makes sense. In asynchronous terms, if 10 concurrent tasks require 5 seconds in total, so each task took 0.5 second, that doesn't mean that 5 tasks would take 5 x 0.5 second or 2.5 seconds. In fact, 5 tasks could still take 5 seconds. For 10 tasks running concurrently, you had to wait for more tasks, so you could finish more tasks while you were waiting. For 5 tasks, you had less tasks to wait for, so you wait for longer for all tasks to finish.

This is common knowledge of course :D It just slipped my mind. This, however, means that getting more users will not mean getting a slower performance. elAPI should still maintain the 5-6 minutes waiting mark for 600-2500 users.

The 100% CPU usage is still an issue though.

In GitLab by @mhxion on Dec 1, 2023, 15:04

The 100% CPU usage is still an issue though.

Having multiple cores fixes the issue 😀

In GitLab by @mhxion on Dec 7, 2023, 02:30

Concurrent request analysis

While trying to debug the issue with slow performance, I ran some tests on how much time each request takes when requests to users endpoint are made concurrently. All times are measured in seconds. The green line shows the mean time. We care about the following two things to gauge performance:

The green line. The lower the green line the better.
The X-axis. The less period of time we see on the X-axis for a fixed number of requests the better.

We notice the first request always takes the longest time, which makes sense as this is when elapi establishes the HTTP connection(s) for the first time. Adding 4 cores to dev-002 also reduces the wait period by almost double! Which suggests that the Python asyncio is well-distributing the tasks across all the virtual cores (though only two threads are running as shown by htop).

The following shows a similar analysis for the first 10 concurrent requests.

In GitLab by @mhxion on Dec 7, 2023, 19:43

We ran a simple stress test on production server and it produced similar satisfying result as seen on dev-002. The stress test mainly involved monitoring load distribution across 4 virtual cores on htop.

In GitLab by @mhxion on Dec 8, 2023, 24:13

elAPI bill-teams no longer slows down when used with read-only API tokens which was observed before. Now, requests with read-only tokens perform just as well as with read-write tokens. It's possible the culprit was this eLabFTW bug which has been fixed in eLabFTW v4.9.

Long-term idea and very low prio: command stresstest or loadtest Takes ether number of users and/or number of experiments and/or autoinkrements both till a response threshold is reached. (Maybe) cleans up created stuff

Unsure if this is really worth it tbh. - we probably will not run into load issues for the foreseeable future. It would be nice to give the community as a way to test lower limits for HW requirements.

Could be a task for new Members

Update from last night:

So far, we've tried:

Increasing MAX_PHP_MEMORY=4G (default 2G)
Increasing NGINX KEEPALIVE_TIMEOUT=180s (default 10)
Increasing PHP_MAX_CHILDREN=500 (default 200)
Increasing PHP_MAX_EXECUTION_TIME=3000 (default 600)
Increasing keepalive_requests 1000; of /etc/nginx/nginx.conf from inside the container

But none of them will let sending concurrent requests more than 500 (and sometimes 100) requests per second. uvloop, last night, helped expose the following error message while testing with bill-teams:

SQLSTATE[HY000] [1040] Too many connections

Googling about it led to a couple of SO answers. It turns out MySQL by default allows only 151 concurrent connections. So the easy fix would be increasing this number. So I did with SET GLOBAL max_connections = 1700; SQL query. And lo and behold, for the first time I was able to make 1000 concurrent requests/second to dev-002 without the server breaking up. With max_connections = 1700 I managed to send up to 1500 requests/s. There was no performance improvement which is okay as the goal of the test was to find out why we could not send 500+ async requests/s. This DB change, of course, is not persistent across container reboots. I was finally convinced the solution to our async networking bottleneck is MySQL's max_connections.

Update from today:

Increasing MAX_PHP_MEMORY=16G from yesterday's 4G (default 2G)

That was it. For some reason, I can no longer reproduce the results from last night anymore. I.e., The server breaks again if I try to send 1000 requests/s.

❯ elapi bill-teams teams-info --export ~
Getting users data: ━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  25% 0:00:09
WARNING  Request for 'users' data was received by the server but request was not successful. Response status: 502.       get_information.py:107
         Exception details: 'JSONDecodeError('Expecting value: line 1 column 1 (char 0)')'. Response: '<!DOCTYPE html>
         <html lang="en">
         <head>
             <!-- Simple HttpErrorPages | MIT X11 License | https://github.com/AndiDittrich/HttpErrorPages -->

             <meta charset="utf-8" />
             <meta http-equiv="X-UA-Compatible" content="IE=edge" />
             <meta name="viewport" content="width=device-width, initial-scale=1" />

             <title>We've got some trouble | 502 - Webservice currently unavailable</title>

             <style type="text/css">/*! normalize.css v5.0.0 | MIT License | github.com/necolas/normalize.css
         */html{font-family:sans-serif;line-height:1.15;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%}body{mar
         gin:0}article,aside,footer,header,nav,section{display:block}h1{font-size:2em;margin:.67em
         0}figcaption,figure,main{display:block}figure{margin:1em
         40px}hr{box-sizing:content-box;height:0;overflow:visible}pre{font-family:monospace,monospace;font-size:1em}a{ba
         ckground-color:transparent;-webkit-text-decoration-skip:objects}a:active,a:hover{outline-width:0}abbr[title]{bo
         rder-bottom:none;text-decoration:underline;text-decoration:underline
         dotted}b,strong{font-weight:inherit}b,strong{font-weight:bolder}code,kbd,samp{font-family:monospace,monospace;f
         ont-size:1em}dfn{font-style:italic}mark{background-color:#ff0;color:#000}small{font-size:80%}sub,sup{font-size:
         75%;line-height:0;position:relative;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}audio,video{display
         :inline-block}audio:not([controls]){display:none;height:0}img{border-style:none}svg:not(:root){overflow:hidden}
         button,input,optgroup,select,textarea{font-family:sans-serif;font-size:100%;line-height:1.15;margin:0}button,in
         put{overflow:visible}button,select{text-transform:none}[type=reset],[type=submit],button,html
         [type=button]{-webkit-appearance:button}[type=button]::-moz-focus-inner,[type=reset]::-moz-focus-inner,[type=su
         bmit]::-moz-focus-inner,button::-moz-focus-inner{border-style:none;padding:0}[type=button]:-moz-focusring,[type
         =reset]:-moz-focusring,[type=submit]:-moz-focusring,button:-moz-focusring{outline:1px dotted
         ButtonText}fieldset{border:1px solid silver;margin:0 2px;padding:.35em .625em
         .75em}legend{box-sizing:border-box;color:inherit;display:table;max-width:100%;padding:0;white-space:normal}prog
         ress{display:inline-block;vertical-align:baseline}textarea{overflow:auto}[type=checkbox],[type=radio]{box-sizin
         g:border-box;padding:0}[type=number]::-webkit-inner-spin-button,[type=number]::-webkit-outer-spin-button{height
         :auto}[type=search]{-webkit-appearance:textfield;outline-offset:-2px}[type=search]::-webkit-search-cancel-butto
         n,[type=search]::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{-webkit-appeara
         nce:button;font:inherit}details,menu{display:block}summary{display:list-item}canvas{display:inline-block}templa
         te{display:none}[hidden]{display:none}/*! Simple HttpErrorPages | MIT X11 License |
         https://github.com/AndiDittrich/HttpErrorPages
         */body,html{width:100%;height:100%;background-color:#21232a}body{color:#fff;text-align:center;text-shadow:0 2px
         4px rgba(0,0,0,.5);padding:0;min-height:100%;-webkit-box-shadow:inset 0 0 75pt rgba(0,0,0,.8);box-shadow:inset
         0 0 75pt rgba(0,0,0,.8);display:table;font-family:"Open
         Sans",Arial,sans-serif}h1{font-family:inherit;font-weight:500;line-height:1.1;color:inherit;font-size:36px}h1
         small{font-size:68%;font-weight:400;line-height:1;color:#777}a{text-decoration:none;color:#fff;font-size:inheri
         t;border-bottom:dotted 1px
         #707070}.lead{color:silver;font-size:21px;line-height:1.4}.cover{display:table-cell;vertical-align:middle;paddi
         ng:0 20px}footer{position:fixed;width:100%;height:40px;left:0;bottom:0;color:#a0a0a0;font-size:14px}</style>
         </head>

         <body>
             <div class="cover">
                 <h1>Webservice currently unavailable <small>Error 502</small></h1>
                 <p class="lead">We've got some trouble with our backend upstream cluster.<br />
         Our service team has been dispatched to bring it back online.</p>
             </div>

             </body>
         </html>
         '
INFO     elapi will try again.                                                                                                       cli.py:120

Weird. The same changes from last night no longer works today! ¯\(ツ)/¯ Will try again later.

uhd-urz / elAPI