Additional websocket connections cause exponential increase in CPU usage

stevenwaterman commented 7 months ago

Detailed Description of the Problem

We recently updated our app to add an extra websocket connection to our backend per client. Over the next hour, the CPU usage on our haproxy server increased from around 30% to 85%. After a few hours, it increased even more to 100%, causing an outage.

CPU Usage graph (rolled out around 14:20)

Here's a graph of rate of incoming WS connections. I've cut off the Y axis at 20k / min, because it goes way higher after restarts / deployments

Data transferred during that time

Active sessions

Expected Behavior

We expected the CPU usage to increase only a small amount, eg from 30% to 40%. Even that seems excessive, since not all our traffic is websockets, and the new websockets don't actually do anything.

Steps to Reproduce the Behavior

It won't be easy for you to reproduce it. For us, we can reproduce it quite reliably, by releasing this code and getting clients to start sending the requests. But we can't just try random things because it's actively serving our customers.

Do you have any idea what may have caused this?

A few things we've considered:

There might be a bug with the new websocket causing it to spam reconnections. But from our logs and from trying it ourselves, the connections seem to work ok
There might have been a bug causing the new websockets to spam tons of traffic. But from our logs, the number of websocket connections matches the number of messages received exactly (so each websocket is sending one message as expected)
There might have been a bug causing tons of PING / PONG traffic that wouldn't show up in logs. But from what we can see when testing it ourselves, the pings are sent every 25s as intended. (Pings are sent by server).
We thought maybe we were hitting some kind of internal limit, which was causing sockets to be queued, leading to high CPU. We increased maxconn, but this didn't do anything.
I saw that there were some previous performance issues with openSSL 3, which we are using. So maybe it's that, but it seems very suspicious that it would cause such a non-linear increase in CPU usage

Do you have an idea how to solve the issue?

No response

What is your configuration?

global
  log /dev/log len 65535 local0 info
  chroot /var/lib/haproxy
  stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
  stats timeout 30s
  user haproxy
  group haproxy
  daemon
  hard-stop-after 1m

  # Default SSL material locations
  ca-base /etc/ssl/certs
  crt-base /etc/ssl/private

  # See: https://ssl-config.mozilla.org/#server=haproxy&version=2.4&config=old&openssl=3.0.2&hsts=false&ocsp=false&guideline=5.6
  ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305
  ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
  ssl-default-bind-options prefer-client-ciphers no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

  ssl-default-server-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305
  ssl-default-server-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
  ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

defaults
  log   global
  #log-format '{"host":"%H","ident":"haproxy","pid":%pid,"time":"%Tl","haproxy":{"conn":{"act":%ac,"fe":%fc,"be":%bc,"srv":%sc},"queue":{"backend":%bq,"srv":%sq},"time":{"tq":%Tq,"tw":%Tw,"tc":%Tc,"tr":%Tr,"tt":%Tt},"termination_state":"%tsc","retries":%rc,"network":{"client_ip":"%ci","client_port":%cp,"frontend_ip":"%fi","frontend_port":%fp},"ssl":{"version":"%sslv","ciphers":"%sslc"},"request":{"method":"%HM","uri":"%[capture.req.uri,json(utf8s)]","protocol":"%HV","header":{"host":"%[capture.req.hdr(0),json(utf8s)]","xforwardfor":"%[capture.req.hdr(1),json(utf8s)]","referer":"%[capture.req.hdr(2),json(utf8s)]","traceparent": %ID""}},"name":{"backend":"%b","frontend":"%ft","server":"%s"},"response":{"status_code":%ST,"traceparent":"%[capture.res.hdr(0),json(utf8s)]"},"bytes":{"uploaded":%U,"read":%B}}}'
  option httplog
  mode  http
  option  forwardfor
  option    dontlognull
  timeout connect 5000
  timeout client  50000
  timeout server  50000
  timeout client-fin 30s
  timeout tunnel  3600s
  maxconn 400000
  errorfile 400 /etc/haproxy/errors/400.http
  errorfile 403 /etc/haproxy/errors/403.http
  errorfile 408 /etc/haproxy/errors/408.http
  errorfile 500 /etc/haproxy/errors/500.http
  errorfile 502 /etc/haproxy/errors/502.http
  errorfile 503 /etc/haproxy/errors/503.http
  errorfile 504 /etc/haproxy/errors/504.http

cache assets_cache
  # Total size of the cache in MB
  total-max-size 4

  # Max size of any single item in bytes
  # 30 kilobytes
  max-object-size 30720

  # Time to live for each item in seconds
  max-age 2630000  # Set max-age to 1 month

frontend stats
   bind xx.xxx.xxx.xxx:xxxx
   option http-use-htx
   http-request set-log-level silent
   http-request use-service prometheus-exporter if { path /metrics }
   stats enable
   stats uri /stats
   stats refresh 10s

frontend talkjs
  bind :80
  bind :::80
  bind :443 ssl crt /etc/ssl/certs/talkjs.com.2024.pem alpn h2,http/1.1
  bind :::443 ssl crt /etc/ssl/certs/talkjs.com.2024.pem alpn h2,http/1.1

  stick-table type string len 200 size 100k expire 10s store http_req_rate(10s)

  # Rate limit per-path, only on internal api
  # Excludes websockets because everyone frantically tries to reconnect while we are releasing
  # Allow 1000 requests per 10s per path
  acl is_internal_api path_beg -i /api/v0
  acl is_websocket path_end -i /socket/websocket
  acl exceeds_limit path,table_http_req_rate() gt 1000

  http-request track-sc0 path unless is_internal_api !is_websocket exceeds_limit
  http-request deny deny_status 429 if is_internal_api !is_websocket exceeds_limit

  # Captures to satisfy our log format
  capture request header Host len 40
  capture request header X-Forwarded-For len 50
  capture request header Referer len 200
  capture request header User-Agent len 200
  capture request header Authorization len 200
  capture request header x-talkjs-client-build len 30
  capture request header x-talkjs-client-date len 20
  # Catch traceparent set by backend, used for logging
  capture response header traceparent len 50

  acl is_assets path_beg -i /__assets
  acl is_mp3 path_end -i .mp3

  http-request cache-use assets_cache if is_assets is_mp3 
  http-response cache-store assets_cache

  http-response set-header X-Cache-Status HIT if !{ srv_id -m found }
  http-response set-header X-Cache-Status MISS if { srv_id -m found }

  # parse UUID [RFC 4122](https://www.rfc-editor.org/rfc/rfc4122.html)
  http-request set-var(txn.trace_id) uuid(4),regsub(\"^([^-]{8})-([^-]{4})-([^-]{4})-([^-]{4})-([^-]{12})$\",\"\1\2\3\4\5\")
  http-request set-var(txn.parent_id) uuid(4),regsub(\"^([^-]{8})-([^-]{4})-([^-]{4})-([^-]{4})-([^-]{12})$\",\"\1\2\3\")
  unique-id-format 00-%[var(txn.trace_id)]-%[var(txn.parent_id)]-03
  unique-id-header traceparent

  # Forward traceparent to backend
  http-request set-header traceparent %ID
  # return traceparent to the client
  http-after-response set-header traceparent %ID

  # Tell the backend if a request was http or https
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Proto https if { ssl_fc }

  # Inject this header on the way back so we can always see which node handled the response
  http-after-response set-header X-TalkJS-Node %s

  default_backend backend-1
  # round robin assets requests
  use_backend assets-backend if is_assets

  # sharding.map is a file with regex parts so backend can be overridden. we
  # route it either using a cookie set by the dashboard, or by checking if the
  # app_id in sharding.map is in the URL. not all dashboard paths have an app_id
  # in the URL so it also sets the cookie for good measure. 
  acl has_cookie req.cook(lb_last_active_test_app_id) -m found
  use_backend %[req.cook(lb_last_active_test_app_id),map_reg(/etc/haproxy/maps/sharding.map)] if has_cookie
  use_backend %[path,map_reg(/etc/haproxy/maps/sharding.map)] unless has_cookie

backend assets-backend
  balance roundrobin
  server production-1-6080 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-1-6081 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-1-6082 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-1-6083 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-2-6080 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-2-6081 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-2-6082 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-2-6083 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  # Uncomment these lines if you want to serve assets from the prod 3 node
  # Since we do roundrobin, enabling these would directly send requests
  # towards the node.
  # server production-3-6080 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  # server production-3-6081 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  # server production-3-6082 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  # server production-3-6083 XX.XXX.XXX.XXX:XXXX check maxconn 63000

backend backend-1
  server production-1-6080 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-1-6081 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-1-6082 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-1-6083 XX.XXX.XXX.XXX:XXXX check maxconn 63000

backend backend-2
  server production-2-6080 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-2-6081 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-2-6082 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-2-6083 XX.XXX.XXX.XXX:XXXX check maxconn 63000

backend backend-3
  server production-3-6080 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-3-6081 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-3-6082 XX.XXX.XXX.XXX:XXXX check maxconn 63000
  server production-3-6083 XX.XXX.XXX.XXX:XXXX check maxconn 63000

backend blackhole
  # No servers defined. You can point users here as a last resort
  # if they are causing high load or other issues, and haproxy
  # will reject all their requests

# Keep a newline here

Output of `haproxy -vv`

HAProxy version 2.4.22-0ubuntu0.22.04.3 2023/12/04 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.22.html
Running on: Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -O2 -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_SYSTEMD=1 USE_PROMEX=1
  DEBUG   = 

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL +EPOLL -EVPORTS +FUTEX +GETADDRINFO -KQUEUE +LIBCRYPT +LINUX_SPLICE +LINUX_TPROXY +LUA -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PRIVATE_CACHE -PROCCTL +PROMEX -PTHREAD_PSHARED -QUIC +RT +SLZ -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
Running on OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.6
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.39 2021-10-29
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2       flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
            fcgi : mode=HTTP       side=BE        mux=FCGI     flags=HTX|HOL_RISK|NO_UPG
              h1 : mode=HTTP       side=FE|BE     mux=H1       flags=HTX|NO_UPG
       <default> : mode=HTTP       side=FE|BE     mux=H1       flags=HTX
            none : mode=TCP        side=FE|BE     mux=PASS     flags=NO_UPG
       <default> : mode=TCP        side=FE|BE     mux=PASS     flags=

Available services : prometheus-exporter
Available filters :
        [SPOE] spoe
        [CACHE] cache
        [FCGI] fcgi-app
        [COMP] compression
        [TRACE] trace

Last Outputs and Backtraces

No response

Additional Information

Before this update, clients had 1 WS connection by default, and 1 additional WS per UI. So most clients had 2. This update made it 2 connections by default, and 1 additional per UI.
The new WS is incredibly simple. It's a JS WebSocket that sends 1 message on startup (here is my token please authenticate) and gets one response. The server sends PINGs every 25s and the client responds with PONG. No other traffic happens.
The 2 pre-existing websockets are much heavier. They are Phoenix websockets, with a significant amount of traffic in both directions.
We are operating at scale. At peak times, the haproxy server has ~150k active connections
The haproxy server is pretty beefy, 8 CPUs with 32GB of RAM
After the rollout of this new websocket, we were receiving around 10k phoenix WS connections per minute, and 3.5k of the new WS connections per minute.
There are also HTTP requests, in addition to the WS connections, but the same amount before / after releasing this new websocket
We attempted and reverted this rollout 3 times, and the same pattern repeated each time
We have been running this new websocket on about 10% of our clients for a week before this full rollout, and didn't notice any issues
This only causes high CPU on haproxy, not on any of the actual backend nodes. So it seems unlikely that it's a legit huge increase in traffic, as that would cause an increase in backend CPU too.
The new WS is routed via cloudflare whereas the phoenix WS are not. Also a weird thing we noticed, when using chrome, the phoenix WS show up as getting a 200 status response, while on firefox they show up as 101. Whereas the new WS shows as a 101 response on both browsers. But this seems more like a browser bug than anything real. They're still being used as websockets (and it's the old websockets being weird not the new ones)

The average connection time reported by haproxy also increases every time we try to roll it out

We ran the haproxy profiler twice. After rollout, with high CPU:

Tasks activity:
  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
  si_cs_io_cb                46037826   15.08s    327.0ns      -         -
  h1_io_cb                   45996464   9.199s    199.0ns      -         -
  ssl_sock_io_cb             45792872   12.10s    264.0ns      -         -
  h2_io_cb                     117077   2.691s    22.99us      -         -
  process_stream                55523   32.24s    580.6us   10.56s    190.2us
  h1_timeout_task               18697   612.0ns      -      2.614s    139.8us
  accept_queue_process           6885   191.2ms   27.77us      -         -
  other                          3799   19.16s    5.043ms   17.11ms   4.504us
  h2_timeout_task                1766      -         -      352.2ms   199.4us
  process_chk                     631   17.73ms   28.11us   80.97ms   128.3us
  session_expire_embryonic         96      -         -      6.725ms   70.05us
  task_run_applet                  20   906.5us   45.32us   631.5us   31.58us

After revert, with lower CPU. CPU takes some time to go back down because existing clients will continue to open the WS connection. This was a few hours after reverting, with CPU back down around 30%:

Tasks activity:
  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
  si_cs_io_cb                38373015   10.95s    285.0ns      -         -
  h1_io_cb                   38343776   7.405s    193.0ns      -         -
  ssl_sock_io_cb             38258723   9.219s    240.0ns      -         -
  h2_io_cb                      95540   2.036s    21.31us      -         -
  process_stream                32001   3.080s    96.26us   1.053s    32.90us
  h1_timeout_task               10263      -         -      663.8ms   64.68us
  accept_queue_process           3193   102.3ms   32.02us      -         -
  process_table_expire           2730   3.697s    1.354ms   4.375ms   1.602us
  h2_timeout_task                1003      -         -      50.37ms   50.22us
  process_chk                     284   6.099ms   21.47us   2.190ms   7.712us
  task_run_applet                  16   1.052ms   65.74us   31.69us   1.980us
  other                            10      -         -      28.07us   2.806us

We noticed the huge difference in process_stream and especially other, but can't find any documentation to help us track down what exactly is slow.

After reverting, the CPU usage doesn't decrease smoothly. It's blocky, eg dropping from 55% to 35% in a minute, hours after the revert

And for proof that these websockets are causing the high CPU, we:

blocked all new WS requests via cloudflare
restarting the backend node, killing all active websockets
observing that CPU instantly dropped from 35% to 15% (this was after reverting so only about 10% of clients were trying to open this websocket)
un-blocking the requests
observing the CPU instantly jump back to 35%

Reading all this back, it really does just look like a websocket that is extremely computationally heavy, transferring a ton of data etc. But I feel the need to stress, it's doing almost nothing. The client opens a connection, sends one message like [0,"POST","/session/renew",{"token":"eyJhbGciOiJIUzI1NiIsInR5cCI6... and gets one response like [0,200,{"user":{"id":"e21b106b-fd2b-459a-a5b0-3886b5f880fa","name":"Visitor.... And then the WS just sits there, open. And in that same time the two phoenix websockets have sent 40 messages.

You can see this for yourself, if you go to www.talkjs.com and look at dev tools network tab for websockets. The one with URL starting wss://api.talkjs.com/v1/efdOVFl/realtime is the new one.

git001 commented 7 months ago

Any chance to update to the latest version http://www.haproxy.org/bugs/bugs-2.4.22.html or to the 2.8 Version and see if the problem still exists?

stevenwaterman commented 7 months ago

Yes, we actually already started working on updating to see if it would help. We'll update and re-release and let you know what happens. Probably on Monday. Thanks!

stevenwaterman commented 6 months ago

We defined additional servers to reduce the number of active sessions per server. This looked promising but once we rolled out the new WS again, CPU usage went up.

Commented out our stick-table and all associated lines of config. This immediately resolved the problem, dropping CPU down to around 30%, and it stayed there. We're not sure why that had any impact, since the new websocket isn't included in the rate limiting.

We then started to get errors after hitting 100k active connections on one backend, which we assume is nothing to do with haproxy and is instead some missing tweak on the application server, eg file descriptors.

But yeah, I'd be interested if you have any ideas why the stick table lines in the config could cause such a huge increase in CPU usage, from relatively few additional connections

rickard-505 commented 1 month ago

Just confirming, I see this too with pfsense/haproxy (since around 2.4-ish) (iam on 24.03 on multiple servers, gets crazy after a while) thats why i have to spread out over multiple boxes) - also had to stop the HA clustering, since that just makes it worse.

i will try the stick table things, just wondering before I do that, is there any other findings later on that might be helpful?

capflam commented 1 week ago

Guys, have you any new inputs on this issue ? For me it is not clear if there is an issue with the websockets or the stick-tables. It seems to be related to stick-tables, but I'm not sure.

haproxy / haproxy