Closed dianabarsan closed 1 year ago
This just happened again:
<149>Apr 4 14:58:20 haproxy[25]: Server couchdb-servers/couchdb-1.local is UP, reason: Layer7 check passed, code: 0, info: "via agent : up", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] (25) : Server couchdb-servers/couchdb-1.local is UP, reason: Layer7 check passed, code: 0, info: "via agent : up", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[NOTICE] (1) : haproxy version is 2.6.12-f588462
[NOTICE] (1) : path to executable is /usr/local/sbin/haproxy
[ALERT] (1) : Current worker (25) exited with code 137 (Killed)
[ALERT] (1) : exit-on-failure: killing every processes with SIGTERM
This user seems to report the same issue: https://github.com/haproxy/haproxy/issues/1984 , ant it seems to affect the same haproxy version that we are using 2.6.x
The suggestion is to upgrade to 2.7
and see if this occurs again. I'll try that.
Even with 2.7
, I'm still seeing very high memory usage:
This has been reported multiple times. This seems very relevant to us: https://github.com/docker-library/haproxy/issues/194
CPU usage spikes, memory seems constant.
The more I look into it, the more this seems to be a memory leak. I'll updated the reproduction steps and issue description once I'm certain.
It looks like memory usage plateaus at 17.56GB. This is tested locally, as the scalability test instance only has 15GB RAM and haproxy exceeding this value is probably causing the instance to become unresponsive. (Monitoring metrics pending).
This issue appears very relevant: https://github.com/docker-library/haproxy/issues/194
Quoting https://github.com/docker-library/haproxy/issues/194#issuecomment-1452758721:
I think the root cause is HAProxy allocating resources for each connection, up to the maximum, and deriving that maximum (maxconn) from the (very high) kernel default file descriptor limit, which is the effective limit when the container runtime file limit is infinity.
I've traced the high memory usage to a config option that was added at some point in the archv3 work: https://github.com/medic/cht-core/blob/master/haproxy/default_frontend.cfg#L13-L14
Without these settings, memory usage plateaus at ~2GB. I'm testing it by running e2e tests against it and running the scalability suite against it.
The solution to this was to:
[NOTICE] (1) : haproxy version is 2.6.13-234aa6d
[NOTICE] (1) : path to executable is /usr/local/sbin/haproxy
[WARNING] (1) : Exiting Master process...
[WARNING] (20) : Proxy http-in stopped (cumulated conns: FE: 74, BE: 0).
[WARNING] (20) : Proxy couchdb-servers stopped (cumulated conns: FE: 0, BE: 74).
[WARNING] (20) : Proxy <HTTPCLIENT> stopped (cumulated conns: FE: 0, BE: 0).
rsyslogd: pidfile '/run/rsyslogd.pid' and pid 14 already exist.
If you want to run multiple instances of rsyslog, you need to specify
different pid files for them (-i option).
rsyslogd: run failed with error -3000 (see rsyslog.h or try https://www.rsyslog.com/e/3000 to learn what that number means)
[alert] 122/114035 (1) : parseBasic loaded
[alert] 122/114035 (1) : parseCookie loaded
[alert] 122/114035 (1) : replacePassword loaded
[alert] 122/114035 (1) : parseBasic loaded
[alert] 122/114035 (1) : parseCookie loaded
[alert] 122/114035 (1) : replacePassword loaded
[NOTICE] (1) : New worker (20) forked
[NOTICE] (1) : Loading success.
[NOTICE] (1) : haproxy version is 2.6.13-234aa6d
[NOTICE] (1) : path to executable is /usr/local/sbin/haproxy
[WARNING] (1) : Exiting Master process...
[WARNING] (20) : Proxy http-in stopped (cumulated conns: FE: 114, BE: 0).
[WARNING] (20) : Proxy couchdb-servers stopped (cumulated conns: FE: 0, BE: 114).
[WARNING] (20) : Proxy <HTTPCLIENT> stopped (cumulated conns: FE: 0, BE: 0).
tune.bufsize
from 1638400
to 32768
maxconn
from 150000
to 60000
Moved to "in progress" as we've decided to remove the "AT" statuses. @dianabarsan Please work with teammates until you're confident to merge this.
@dianabarsan - I see @garethbowen approved the PR last week, but that there's a QA ("assistance"!! ; ) request in the prior comment. Lemme know if you need a hand with that :hand:
hey @mrjones-plip Thanks for jumping in! I'm very close to confident about merging this already, do you have any ideas on something specific to test?
Go ahead!
Awesome! I'll work on this today and hopefully have something by the time you come back online tomorrow :crossed_fingers:
tl;dr - I think the improvements are great! We're now much more resilient against HA going down and serving 500
s.
I had trouble reliably reproducing this with an unconstrained container on a 32GB bare-metal machine. HA proxy would float up to about 16GB of RAM use.
When I constrained the container to 6GB RAM, it was no problem to reproduce. Here's the haproxy
changes I made in the compose file
haproxy:
deploy:
resources:
limits:
memory: 6G
In testing I always used the same command:
ab -s 300 -k -c 600 -n 800 https://INSTANCE_URL/api/v2/monitoring
while having a concurrency of 600 across 800 connections to our monitoring API is totally unreasonable, we should never irrecoverably crash.
Additionally I used lazydocker
and then followed the logs of the ha proxy instance with docker logs -f HA_PROXY_CONTAINER
Every single time I ran this command against my 4.1.0 instance it crashed in under 10 seconds. Every single time, all day every day:
https://github.com/medic/cht-core/assets/8253488/bacb4baa-766e-4fe1-837a-53aec148872e
While hitting the API would result in an HTTP 200
, the stats were all -1
values. Further, all requests to /medic
and the like were 500
. The only fix was to restart HA Proxy service.
Using 4.1.0-8166-new-haproxy
, I did see failure rates as high as 85%, I also saw them as low as 10%. Most importantly, HA Proxy was rock solid and NEVER crashed:
https://github.com/medic/cht-core/assets/8253488/18ea218c-e65d-4ae3-a953-4c388da829be
Despite the errors, both the API and the web GUI (including /medic
) was responsive before and after the test calls.
Merged to master
.
Describe the bug I started a scalability test, with 100 concurrent users, and haproxy crashed while the test was running. The behavior was that:
To Reproduce Steps to reproduce the behavior:
master
Expected behavior Haproxy should not crash
Logs Beginning of haproxy log - to check that we're indeed on the latest config:
API Error:
Haproxy end of log:
Output of
docker ps
:Environment
Additional context Manually restarting the haproxy container fixes the issue.
I don't have any insight into what the underlying issue is, I'll keep running tests to see how frequently this occurs. In the worst case scenario, we should find a way to signal to the container to restart because the process was terminated.
It looks like this is being caused by Haproxy allocating resources for each opened connection, and those resources do not get freed, ending up bloating haproxy memory usage to 17GB, which is more than the test instance has allocated in total.