Open alexander-haller opened 11 months ago
In GitLab by @mhxion on Dec 1, 2023, 02:29
Querying 600+ users from production also takes 5-6 minutes where it should take at most 2 minutes.
Actually, thinking it over it makes sense. In asynchronous terms, if 10 concurrent tasks require 5 seconds in total, so each task took 0.5 second, that doesn't mean that 5 tasks would take 5 x 0.5 second or 2.5 seconds. In fact, 5 tasks could still take 5 seconds. For 10 tasks running concurrently, you had to wait for more tasks, so you could finish more tasks while you were waiting. For 5 tasks, you had less tasks to wait for, so you wait for longer for all tasks to finish.
This is common knowledge of course :D It just slipped my mind. This, however, means that getting more users will not mean getting a slower performance. elAPI should still maintain the 5-6 minutes waiting mark for 600-2500 users.
The 100% CPU usage is still an issue though.
In GitLab by @mhxion on Dec 1, 2023, 15:04
The 100% CPU usage is still an issue though.
Having multiple cores fixes the issue π
In GitLab by @mhxion on Dec 7, 2023, 02:30
While trying to debug the issue with slow performance, I ran some tests on how much time each request takes when requests to users endpoint are made concurrently. All times are measured in seconds. The green line shows the mean time. We care about the following two things to gauge performance:
We notice the first request always takes the longest time, which makes sense as this is when elapi establishes the HTTP connection(s) for the first time. Adding 4 cores to dev-002
also reduces the wait period by almost double! Which suggests that the Python asyncio is well-distributing the tasks across all the virtual cores (though only two threads are running as shown by htop
).
The following shows a similar analysis for the first 10 concurrent requests.
In GitLab by @mhxion on Dec 7, 2023, 19:43
We ran a simple stress test on production server and it produced similar satisfying result as seen on dev-002
. The stress test mainly involved monitoring load distribution across 4 virtual cores on htop
.
In GitLab by @mhxion on Dec 8, 2023, 24:13
elAPI bill-teams
no longer slows down when used with read-only API tokens which was observed before. Now, requests with read-only tokens perform just as well as with read-write tokens. It's possible the culprit was this eLabFTW bug which has been fixed in eLabFTW v4.9.
Long-term idea and very low prio: command stresstest or loadtest Takes ether number of users and/or number of experiments and/or autoinkrements both till a response threshold is reached. (Maybe) cleans up created stuff
Unsure if this is really worth it tbh. - we probably will not run into load issues for the foreseeable future. It would be nice to give the community as a way to test lower limits for HW requirements.
Could be a task for new Members
So far, we've tried:
MAX_PHP_MEMORY=4G
(default 2G)KEEPALIVE_TIMEOUT=180s
(default 10)PHP_MAX_CHILDREN=500
(default 200)PHP_MAX_EXECUTION_TIME=3000
(default 600)keepalive_requests 1000;
of /etc/nginx/nginx.conf
from inside the containerBut none of them will let sending concurrent requests more than 500 (and sometimes 100) requests per second. uvloop
, last night, helped expose the following error message while testing with bill-teams
:
SQLSTATE[HY000] [1040] Too many connections
Googling about it led to a couple of SO answers. It turns out MySQL by default allows only 151 concurrent connections. So the easy fix would be increasing this number. So I did with SET GLOBAL max_connections = 1700;
SQL query. And lo and behold, for the first time I was able to make 1000 concurrent requests/second to dev-002
without the server breaking up. With max_connections = 1700
I managed to send up to 1500 requests/s. There was no performance improvement which is okay as the goal of the test was to find out why we could not send 500+ async requests/s. This DB change, of course, is not persistent across container reboots. I was finally convinced the solution to our async networking bottleneck is MySQL's max_connections
.
MAX_PHP_MEMORY=16G
from yesterday's 4G
(default 2G)That was it. For some reason, I can no longer reproduce the results from last night anymore. I.e., The server breaks again if I try to send 1000 requests/s.
β― elapi bill-teams teams-info --export ~
Getting users data: ββββββββββΈββββββββββββββββββββββββββββββ 25% 0:00:09
WARNING Request for 'users' data was received by the server but request was not successful. Response status: 502. get_information.py:107
Exception details: 'JSONDecodeError('Expecting value: line 1 column 1 (char 0)')'. Response: '<!DOCTYPE html>
<html lang="en">
<head>
<!-- Simple HttpErrorPages | MIT X11 License | https://github.com/AndiDittrich/HttpErrorPages -->
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>We've got some trouble | 502 - Webservice currently unavailable</title>
<style type="text/css">/*! normalize.css v5.0.0 | MIT License | github.com/necolas/normalize.css
*/html{font-family:sans-serif;line-height:1.15;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%}body{mar
gin:0}article,aside,footer,header,nav,section{display:block}h1{font-size:2em;margin:.67em
0}figcaption,figure,main{display:block}figure{margin:1em
40px}hr{box-sizing:content-box;height:0;overflow:visible}pre{font-family:monospace,monospace;font-size:1em}a{ba
ckground-color:transparent;-webkit-text-decoration-skip:objects}a:active,a:hover{outline-width:0}abbr[title]{bo
rder-bottom:none;text-decoration:underline;text-decoration:underline
dotted}b,strong{font-weight:inherit}b,strong{font-weight:bolder}code,kbd,samp{font-family:monospace,monospace;f
ont-size:1em}dfn{font-style:italic}mark{background-color:#ff0;color:#000}small{font-size:80%}sub,sup{font-size:
75%;line-height:0;position:relative;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}audio,video{display
:inline-block}audio:not([controls]){display:none;height:0}img{border-style:none}svg:not(:root){overflow:hidden}
button,input,optgroup,select,textarea{font-family:sans-serif;font-size:100%;line-height:1.15;margin:0}button,in
put{overflow:visible}button,select{text-transform:none}[type=reset],[type=submit],button,html
[type=button]{-webkit-appearance:button}[type=button]::-moz-focus-inner,[type=reset]::-moz-focus-inner,[type=su
bmit]::-moz-focus-inner,button::-moz-focus-inner{border-style:none;padding:0}[type=button]:-moz-focusring,[type
=reset]:-moz-focusring,[type=submit]:-moz-focusring,button:-moz-focusring{outline:1px dotted
ButtonText}fieldset{border:1px solid silver;margin:0 2px;padding:.35em .625em
.75em}legend{box-sizing:border-box;color:inherit;display:table;max-width:100%;padding:0;white-space:normal}prog
ress{display:inline-block;vertical-align:baseline}textarea{overflow:auto}[type=checkbox],[type=radio]{box-sizin
g:border-box;padding:0}[type=number]::-webkit-inner-spin-button,[type=number]::-webkit-outer-spin-button{height
:auto}[type=search]{-webkit-appearance:textfield;outline-offset:-2px}[type=search]::-webkit-search-cancel-butto
n,[type=search]::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{-webkit-appeara
nce:button;font:inherit}details,menu{display:block}summary{display:list-item}canvas{display:inline-block}templa
te{display:none}[hidden]{display:none}/*! Simple HttpErrorPages | MIT X11 License |
https://github.com/AndiDittrich/HttpErrorPages
*/body,html{width:100%;height:100%;background-color:#21232a}body{color:#fff;text-align:center;text-shadow:0 2px
4px rgba(0,0,0,.5);padding:0;min-height:100%;-webkit-box-shadow:inset 0 0 75pt rgba(0,0,0,.8);box-shadow:inset
0 0 75pt rgba(0,0,0,.8);display:table;font-family:"Open
Sans",Arial,sans-serif}h1{font-family:inherit;font-weight:500;line-height:1.1;color:inherit;font-size:36px}h1
small{font-size:68%;font-weight:400;line-height:1;color:#777}a{text-decoration:none;color:#fff;font-size:inheri
t;border-bottom:dotted 1px
#707070}.lead{color:silver;font-size:21px;line-height:1.4}.cover{display:table-cell;vertical-align:middle;paddi
ng:0 20px}footer{position:fixed;width:100%;height:40px;left:0;bottom:0;color:#a0a0a0;font-size:14px}</style>
</head>
<body>
<div class="cover">
<h1>Webservice currently unavailable <small>Error 502</small></h1>
<p class="lead">We've got some trouble with our backend upstream cluster.<br />
Our service team has been dispatched to bring it back online.</p>
</div>
</body>
</html>
'
INFO elapi will try again. cli.py:120
Weird. The same changes from last night no longer works today! Β―\(γ)/Β― Will try again later.
In GitLab by @mhxion on Dec 1, 2023, 24:02
I did some digging into why the responses are slower all on a sudden.
Monitoring server load
Since no. 2 wasn't a latency issue, I wanted look into the server load. This can also count as a minimal stress test.
When I run bill-teams targeting dev-002 on my own machine and monitor
htop
stat on dev-002 through SSH, I get the following results. Worth noting, the slow responses mainly come from when we make asynchronous requests to get every user data.i. Making 100 concurrent connections (so that's 100 concurrent
GET
requests to the user endpoint) gives us the following thehtop
change. The number 188 for threads make sense (before starting bill-teams it was about 86 threads). That many concurrent connections give us 100% CPU usage. This is the default number of concurrent connections of elAPI which gave us really satisfying performance before on dev-002.ii. Making 5 concurrent connections gives us the following
htop
.That also causes almost 100% CPU usage spike. In fact, having 100 concurrent connections doesn't improve the overall timing by that much. I should have done this test long ago so I would have some more data to compare :/ I am not sure how bad this is.