medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
467 stars 217 forks source link

Haproxy crashes instance during scalability test due to high memory usage #8166

Closed dianabarsan closed 1 year ago

dianabarsan commented 1 year ago

Describe the bug I started a scalability test, with 100 concurrent users, and haproxy crashed while the test was running. The behavior was that:

To Reproduce Steps to reproduce the behavior:

  1. Start scalability test against current master
  2. Watch as scalability test ends with errors

Expected behavior Haproxy should not crash

Logs Beginning of haproxy log - to check that we're indeed on the latest config:

Starting enhanced syslogd: rsyslogd.
global
  maxconn 150000
  spread-checks 5
  lua-load-per-thread /usr/local/etc/haproxy/parse_basic.lua
  lua-load-per-thread /usr/local/etc/haproxy/parse_cookie.lua
  lua-load-per-thread /usr/local/etc/haproxy/replace_password.lua
  log stdout len 65535 local2 debug
  tune.bufsize 1638400
  tune.http.maxhdr 1010

API Error:

REQ 20d65ffb-3c6a-4290-af76-5bd28d9a5cf3 35.178.234.114 ac2 GET /medic/_changes?style=all_docs&heartbeat=10000&since=0&limit=100 HTTP/1.0
2023-04-04 14:18:54 ERROR: Server error: RequestError: Error: connect ECONNREFUSED 10.0.1.11:5984
    at new RequestError (/api/node_modules/request-promise-core/lib/errors.js:14:15)
    at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:87:29)
    at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at self.callback (/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.onRequestError (/api/node_modules/request/request.js:877:8)
    at ClientRequest.emit (node:events:513:28)
    at Socket.socketErrorListener (node:_http_client:494:9)
    at Socket.emit (node:events:513:28)
    at emitErrorNT (node:internal/streams/destroy:157:8) {
  name: 'RequestError',
  message: 'Error: connect ECONNREFUSED 10.0.1.11:5984',
  cause: Error: connect ECONNREFUSED 10.0.1.11:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
    [stack]: 'Error: connect ECONNREFUSED 10.0.1.11:5984\n' +
      '    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)',
    [message]: 'connect ECONNREFUSED 10.0.1.11:5984',
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.11',
    port: 5984
  },
  error: Error: connect ECONNREFUSED 10.0.1.11:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
    [stack]: 'Error: connect ECONNREFUSED 10.0.1.11:5984\n' +
      '    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)',
    [message]: 'connect ECONNREFUSED 10.0.1.11:5984',
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.11',
    port: 5984
  },
} 
RES 20d65ffb-3c6a-4290-af76-5bd28d9a5cf3 35.178.234.114 ac2 GET /medic/_changes?style=all_docs&heartbeat=10000&since=0&limit=100 HTTP/1.0 500 - 1.629 ms

Haproxy end of log:

<150>Apr  4 13:30:20 haproxy[25]: 10.0.1.3,couchdb-3.local,200,1743,6680,0,GET,/medic/org.couchdb.user%3Aac2?,api,medic,'-',495,1546,232,'node-fetch/1.0 (+https://github.com/bitinn/node-fetch)'
<150>Apr  4 13:30:21 haproxy[25]: 10.0.1.3,couchdb-2.local,200,2081,7009,0,GET,/_users/org.couchdb.user%3Aac2?,api,medic,'-',576,2080,313,'node-fetch/1.0 (+https://github.com/bitinn/node-fetch)'
[NOTICE]   (1) : haproxy version is 2.6.12-f588462
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : Current worker (25) exited with code 137 (Killed)
[ALERT]    (1) : exit-on-failure: killing every processes with SIGTERM

Output of docker ps:

CONTAINER ID   IMAGE                                                         COMMAND                   CREATED       STATUS             PORTS                                                                      NAMES
0b2e118f87cc   public.ecr.aws/s5s3h4s7/cht-nginx:4.1.0-alpha                 "/docker-entrypoint.…"    5 hours ago   Up About an hour   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   compose_nginx_1
2f166344d02a   public.ecr.aws/s5s3h4s7/cht-api:4.1.0-alpha                   "/bin/bash /api/dock…"    5 hours ago   Up About an hour   5988/tcp                                                                   compose_api_1
bd47151e7926   public.ecr.aws/s5s3h4s7/cht-sentinel:4.1.0-alpha              "/bin/bash /sentinel…"    5 hours ago   Up About an hour                                                                              compose_sentinel_1
27959cce073d   public.ecr.aws/medic/cht-haproxy:4.1.0-alpha                  "/entrypoint.sh hapr…"    5 hours ago   Up About an hour   5984/tcp                                                                   compose_haproxy_1
53f9ec2c028a   public.ecr.aws/s5s3h4s7/cht-haproxy-healthcheck:4.1.0-alpha   "/bin/sh -c \"/app/ch…"   5 hours ago   Up About an hour                                                                              compose_healthcheck_1

Environment

Additional context Manually restarting the haproxy container fixes the issue.

I don't have any insight into what the underlying issue is, I'll keep running tests to see how frequently this occurs. In the worst case scenario, we should find a way to signal to the container to restart because the process was terminated.

It looks like this is being caused by Haproxy allocating resources for each opened connection, and those resources do not get freed, ending up bloating haproxy memory usage to 17GB, which is more than the test instance has allocated in total.

dianabarsan commented 1 year ago

This just happened again:

<149>Apr  4 14:58:20 haproxy[25]: Server couchdb-servers/couchdb-1.local is UP, reason: Layer7 check passed, code: 0, info: "via agent : up", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING]  (25) : Server couchdb-servers/couchdb-1.local is UP, reason: Layer7 check passed, code: 0, info: "via agent : up", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[NOTICE]   (1) : haproxy version is 2.6.12-f588462
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : Current worker (25) exited with code 137 (Killed)
[ALERT]    (1) : exit-on-failure: killing every processes with SIGTERM
dianabarsan commented 1 year ago

This user seems to report the same issue: https://github.com/haproxy/haproxy/issues/1984 , ant it seems to affect the same haproxy version that we are using 2.6.x The suggestion is to upgrade to 2.7 and see if this occurs again. I'll try that.

dianabarsan commented 1 year ago

Even with 2.7, I'm still seeing very high memory usage: image

This has been reported multiple times. This seems very relevant to us: https://github.com/docker-library/haproxy/issues/194

CPU usage spikes, memory seems constant.

dianabarsan commented 1 year ago

The more I look into it, the more this seems to be a memory leak. I'll updated the reproduction steps and issue description once I'm certain.

dianabarsan commented 1 year ago

It looks like memory usage plateaus at 17.56GB. This is tested locally, as the scalability test instance only has 15GB RAM and haproxy exceeding this value is probably causing the instance to become unresponsive. (Monitoring metrics pending).

image

This issue appears very relevant: https://github.com/docker-library/haproxy/issues/194

Quoting https://github.com/docker-library/haproxy/issues/194#issuecomment-1452758721:

I think the root cause is HAProxy allocating resources for each connection, up to the maximum, and deriving that maximum (maxconn) from the (very high) kernel default file descriptor limit, which is the effective limit when the container runtime file limit is infinity.

dianabarsan commented 1 year ago

I've traced the high memory usage to a config option that was added at some point in the archv3 work: https://github.com/medic/cht-core/blob/master/haproxy/default_frontend.cfg#L13-L14

Without these settings, memory usage plateaus at ~2GB. I'm testing it by running e2e tests against it and running the scalability suite against it.

dianabarsan commented 1 year ago

The solution to this was to:

  1. remove rsyslog because
    • it prevented container restart when the process was killed by OOM
    • on a forceful restart, would crash the container with:
      [NOTICE]   (1) : haproxy version is 2.6.13-234aa6d
      [NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
      [WARNING]  (1) : Exiting Master process...
      [WARNING]  (20) : Proxy http-in stopped (cumulated conns: FE: 74, BE: 0).
      [WARNING]  (20) : Proxy couchdb-servers stopped (cumulated conns: FE: 0, BE: 74).
      [WARNING]  (20) : Proxy <HTTPCLIENT> stopped (cumulated conns: FE: 0, BE: 0).
      rsyslogd: pidfile '/run/rsyslogd.pid' and pid 14 already exist.
      If you want to run multiple instances of rsyslog, you need to specify
      different pid files for them (-i option).
      rsyslogd: run failed with error -3000 (see rsyslog.h or try https://www.rsyslog.com/e/3000 to learn what that number means)
      [alert] 122/114035 (1) : parseBasic loaded
      [alert] 122/114035 (1) : parseCookie loaded
      [alert] 122/114035 (1) : replacePassword loaded
      [alert] 122/114035 (1) : parseBasic loaded
      [alert] 122/114035 (1) : parseCookie loaded
      [alert] 122/114035 (1) : replacePassword loaded
      [NOTICE]   (1) : New worker (20) forked
      [NOTICE]   (1) : Loading success.
      [NOTICE]   (1) : haproxy version is 2.6.13-234aa6d
      [NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
      [WARNING]  (1) : Exiting Master process...
      [WARNING]  (20) : Proxy http-in stopped (cumulated conns: FE: 114, BE: 0).
      [WARNING]  (20) : Proxy couchdb-servers stopped (cumulated conns: FE: 0, BE: 114).
      [WARNING]  (20) : Proxy <HTTPCLIENT> stopped (cumulated conns: FE: 0, BE: 0).
  2. Lower value of tune.bufsize from 1638400 to 32768
  3. Reduced maxconn from 150000 to 60000
  4. Added docker limit of 1GB of memory per process.
garethbowen commented 1 year ago

Moved to "in progress" as we've decided to remove the "AT" statuses. @dianabarsan Please work with teammates until you're confident to merge this.

mrjones-plip commented 1 year ago

@dianabarsan - I see @garethbowen approved the PR last week, but that there's a QA ("assistance"!! ; ) request in the prior comment. Lemme know if you need a hand with that :hand:

dianabarsan commented 1 year ago

hey @mrjones-plip Thanks for jumping in! I'm very close to confident about merging this already, do you have any ideas on something specific to test?

mrjones-plip commented 1 year ago

If you're good, I'm good!

Otherwise, my idea was not something specif, but I was gonna do a bit more black box approach and throw ab or wrk at it before and after fix to see how it behaves. I was also thinking of doing my own exploratory reading about tune.bufsize and maxconn.

dianabarsan commented 1 year ago

Go ahead!

mrjones-plip commented 1 year ago

Awesome! I'll work on this today and hopefully have something by the time you come back online tomorrow :crossed_fingers:

mrjones-plip commented 1 year ago

tl;dr - I think the improvements are great! We're now much more resilient against HA going down and serving 500s.


I had trouble reliably reproducing this with an unconstrained container on a 32GB bare-metal machine. HA proxy would float up to about 16GB of RAM use.

When I constrained the container to 6GB RAM, it was no problem to reproduce. Here's the haproxy changes I made in the compose file

  haproxy:
    deploy:
      resources:
         limits:
           memory: 6G

In testing I always used the same command:

ab -s 300 -k -c 600 -n 800 https://INSTANCE_URL/api/v2/monitoring

while having a concurrency of 600 across 800 connections to our monitoring API is totally unreasonable, we should never irrecoverably crash.

Additionally I used lazydocker and then followed the logs of the ha proxy instance with docker logs -f HA_PROXY_CONTAINER

Every single time I ran this command against my 4.1.0 instance it crashed in under 10 seconds. Every single time, all day every day:

https://github.com/medic/cht-core/assets/8253488/bacb4baa-766e-4fe1-837a-53aec148872e

While hitting the API would result in an HTTP 200, the stats were all -1 values. Further, all requests to /medic and the like were 500. The only fix was to restart HA Proxy service.

Using 4.1.0-8166-new-haproxy, I did see failure rates as high as 85%, I also saw them as low as 10%. Most importantly, HA Proxy was rock solid and NEVER crashed:

https://github.com/medic/cht-core/assets/8253488/18ea218c-e65d-4ae3-a953-4c388da829be

Despite the errors, both the API and the web GUI (including /medic) was responsive before and after the test calls.

dianabarsan commented 1 year ago

Merged to master.