Closed grepory closed 6 years ago
As I was working on another issue today, I realized that the path from session to keepalives is much shorter:
Session -> MessageBus (TopicKeepalive) -> Keepalived.
So there's no need to look into eventd right now. So something is happening to prevent messages getting to keepalived from the Session. I'm not sure what it is though.
It's worth noting in this trace, 1139 entities were registered. Once the following logs are hit on the backend, entities stop registering. However, after killing agent-bench processes, entities do deregister themselves.
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"warning","msg":"stopping session \u003c%s\u003e: recv error: 696d2136-04c7-4128-bc80-3a3b48fd38f4Connection error: websocket: close 1006 (abnormal closure): unexpected EOF","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"info","msg":"session disconnected - stopping recvPump","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"info","msg":"shutting down - stopping sendPump","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"info","msg":"shutting down - stopping subPump","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"debug","msg":"Unsubscribing from topic \"sensu:check:default:default:default\"","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"debug","msg":"Unsubscribing from topic \"sensu:check:default:default:entity:6b5d1424-72ea-4b60-b823-9beeb3d27566\"","time":"2018-03-20T20:57:43Z"}
...
Mar 20 20:58:09 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"error","msg":"transport error on websocket upgrade: websocket: client sent data before handshake is complete","time":"2018-03-20T20:58:09Z"}
Mar 20 20:58:09 load-test-backend sensu-backend[20881]: {"component":"etcd","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pkg":"","time":"2018-03-20T20:58:09Z"}
Mar 20 20:58:09 load-test-backend sensu-backend[20881]: {"component":"etcd","level":"info","msg":"http: response.Write on hijacked connection\n","pkg":"","time":"2018-03-20T20:58:09Z"}
Oh! This trace is so incredibly helpful. I think what’s happening is we are read locking in the message bus and then never go away. We just need a better mechanism for managing subscribers.
Throwing this back a column after discussion with @grepory about potential refactor.
Oops, I closed this out as #1264 the other day. Fixed!
In load testing last week, I observed that after a certain level of load (1000-3000 connected agents on a 4cpu 4gb vm), keepalived started deregistering agents and stopped registering new ones.
This could be any of the following interactions
One thing of note: there are no sensu-backend log messages for keepalived.