Open stevenwaterman opened 7 months ago
Any chance to update to the latest version http://www.haproxy.org/bugs/bugs-2.4.22.html or to the 2.8 Version and see if the problem still exists?
Yes, we actually already started working on updating to see if it would help. We'll update and re-release and let you know what happens. Probably on Monday. Thanks!
We defined additional servers to reduce the number of active sessions per server. This looked promising but once we rolled out the new WS again, CPU usage went up.
Commented out our stick-table
and all associated lines of config. This immediately resolved the problem, dropping CPU down to around 30%, and it stayed there. We're not sure why that had any impact, since the new websocket isn't included in the rate limiting.
We then started to get errors after hitting 100k active connections on one backend, which we assume is nothing to do with haproxy and is instead some missing tweak on the application server, eg file descriptors.
But yeah, I'd be interested if you have any ideas why the stick table lines in the config could cause such a huge increase in CPU usage, from relatively few additional connections
Just confirming, I see this too with pfsense/haproxy (since around 2.4-ish) (iam on 24.03 on multiple servers, gets crazy after a while) thats why i have to spread out over multiple boxes) - also had to stop the HA clustering, since that just makes it worse.
i will try the stick table things, just wondering before I do that, is there any other findings later on that might be helpful?
Guys, have you any new inputs on this issue ? For me it is not clear if there is an issue with the websockets or the stick-tables. It seems to be related to stick-tables, but I'm not sure.
Detailed Description of the Problem
We recently updated our app to add an extra websocket connection to our backend per client. Over the next hour, the CPU usage on our haproxy server increased from around 30% to 85%. After a few hours, it increased even more to 100%, causing an outage.
CPU Usage graph (rolled out around 14:20)
Here's a graph of rate of incoming WS connections. I've cut off the Y axis at 20k / min, because it goes way higher after restarts / deployments
Data transferred during that time
Active sessions
Expected Behavior
We expected the CPU usage to increase only a small amount, eg from 30% to 40%. Even that seems excessive, since not all our traffic is websockets, and the new websockets don't actually do anything.
Steps to Reproduce the Behavior
It won't be easy for you to reproduce it. For us, we can reproduce it quite reliably, by releasing this code and getting clients to start sending the requests. But we can't just try random things because it's actively serving our customers.
Do you have any idea what may have caused this?
A few things we've considered:
PING
/PONG
traffic that wouldn't show up in logs. But from what we can see when testing it ourselves, the pings are sent every 25s as intended. (Pings are sent by server).maxconn
, but this didn't do anything.Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
No response
Additional Information
WebSocket
that sends 1 message on startup (here is my token please authenticate) and gets one response. The server sendsPING
s every 25s and the client responds withPONG
. No other traffic happens.The average connection time reported by haproxy also increases every time we try to roll it out
We ran the haproxy profiler twice. After rollout, with high CPU:
After revert, with lower CPU. CPU takes some time to go back down because existing clients will continue to open the WS connection. This was a few hours after reverting, with CPU back down around 30%:
We noticed the huge difference in
process_stream
and especiallyother
, but can't find any documentation to help us track down what exactly is slow.After reverting, the CPU usage doesn't decrease smoothly. It's blocky, eg dropping from 55% to 35% in a minute, hours after the revert
And for proof that these websockets are causing the high CPU, we:
Reading all this back, it really does just look like a websocket that is extremely computationally heavy, transferring a ton of data etc. But I feel the need to stress, it's doing almost nothing. The client opens a connection, sends one message like
[0,"POST","/session/renew",{"token":"eyJhbGciOiJIUzI1NiIsInR5cCI6...
and gets one response like[0,200,{"user":{"id":"e21b106b-fd2b-459a-a5b0-3886b5f880fa","name":"Visitor...
. And then the WS just sits there, open. And in that same time the two phoenix websockets have sent 40 messages.You can see this for yourself, if you go to www.talkjs.com and look at dev tools network tab for websockets. The one with URL starting
wss://api.talkjs.com/v1/efdOVFl/realtime
is the new one.