Closed dianabarsan closed 1 year ago
@dianabarsan I'm having trouble reproducing this locally. This is my equivalent screenshot running master
also on a Ryzen 9 5900X
This is similar to when I run 4.1.x
.
The differences that stand out comparing mine vs yours:
CPU: 10% vs 100% MEM USAGE: 122 MiB vs 2 MiB Net IO and Block IO: yours is basically non-existent
Firstly, let's make sure we're doing the same thing. Are you just running npm run wdio-local
?
Was that screenshot taken during startup, or while running the tests?
Can you reproduce it and check the haproxy logs?
Any other ideas?
I've installed latest master
on https://march1-archv3-scalability.dev.medicmobile.org
and I'm running the scalability test against it, and I'm seeing these numbers:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
8d3c55f16902 compose_nginx_1 0.55% 189.5MiB / 14.64GiB 1.26% 34.6MB / 32.5MB 3.26MB / 16.4kB 9
91cb5aa3c250 compose_api_1 121.96% 550.3MiB / 14.64GiB 3.67% 197MB / 61.7MB 0B / 16.4kB 12
3b154e400283 compose_sentinel_1 0.00% 46.57MiB / 14.64GiB 0.31% 1.12MB / 84.4kB 1.78MB / 0B 12
73cbfbf7a697 compose_haproxy_1 521.87% 5.884GiB / 14.64GiB 40.19% 228MB / 227MB 0B / 24.6kB 13
bf878b915513 compose_healthcheck_1 0.23% 23.91MiB / 14.64GiB 0.16% 1.69MB / 875kB 28.3MB / 12.6MB 3
I added a script that will query docker stats
every second and write to a file. This was the last entry before the whole AWS instance became unresponsive.
Was there anything in the haproxy logs? 0B written to BLOCK I/O just seems wrong?
Theory... lua is locking up on this line:
| 0x557059f82403 [48 85 d0 75 20 48 89 f0]: lua_take_global_lock+0x23/0x4c
In 2.4 haproxy introduced multithreading in lua: https://www.haproxy.com/blog/announcing-haproxy-2-4/
They also introduced a new directive lua-load-per-thread
which runs the lua script per thread which is what we want for parsing each request. Replacing lua-load
with lua-load-per-thread
seems to work but because I didn't manage to reproduce this the first time I'm not sure whether it's better or the same. This thread lock explanation would also help explain why it's showing up in the scalability suite so badly.
More info in this haproxy issue: https://github.com/haproxy/haproxy/issues/1625
@dianabarsan Can you try this and see if it's any better?
I've made the change in the haproxy config, and I'm running the scalability test locally:
So far, these are the stats while Sentinel is processing all pushed docs:
I ran the scalability suite against a local instance and recorded docker states
for the duration.
These are the cpu % usage for the haproxy container:
I'll run the same test against the test instance (which is now unresponsive).
@mrjones-plip This change does not affect any released version of CHT as it was introduced with the not-yet-released haproxy upgrade.
Thanks for catching this @garethbowen !
I ran the scalability test against this branch https://github.com/medic/cht-core/tree/8130-haproxy-lua-check
and this is the graph of haproxy CPU usage with 100 concurrent users.
The instance didn't crash and the test finished successfully, though the CPU usage is still very high.
I've run the scalability suite on the same machine as above, but deployed 4.1.0 and this is the CPU usage for haproxy:
They are strikingly similar, so I believe that we've resolved the regression with the lua script change.
It's still worrying that so much CPU time is used for haproxy.
Thanks @dianabarsan - that's great testing.
Let's get this issue resolved before 4.2.0, and re-evaluate our auditing solution later?
Ready for AT in 8130-haproxy-lua-lock
This is a performance improvement which should not change functionality at all. The AT steps are the same as discussed in multiple comments: https://github.com/medic/cht-core/issues/8076#issuecomment-1439060210
@dianabarsan and I have agreed between us we've ATed this so no more needed.
Describe the bug During e2e tests, I have frequently noticed the haproxy container reaching absurd CPU numbers on my machine (I have an AMD Ryzen 9 5900X). Very frequently, haproxy would take up to 100% of one CPU core.
Then, during a stress test on an AWS hosted distributed setup, everytime I loaded the instance with more than 100 users, the suite failed because haproxy had stopped responding. The container was not killed, or restarted, but the instance was not reachable and API reported not being able to connect to haproxy.
Upon inspecting haproxy logs, at first I noticed:
On a subsequent retry, I saw:
Right before this happened, this is a snapshot of
docker stats
on the AWS instance:Stats: haproxy is using 282% cpu and 4.349G of memory.
To Reproduce Steps to reproduce the behavior:
Expected behavior Haproxy shouldn't take up as many resources. Also, the container should somehow restart automatically, instead of just hang when this issue happens.
Environment