sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

Under load keepalived stops receiving keepalives #1182

Closed grepory closed 6 years ago

grepory commented 6 years ago

In load testing last week, I observed that after a certain level of load (1000-3000 connected agents on a 4cpu 4gb vm), keepalived started deregistering agents and stopped registering new ones.

This could be any of the following interactions

One thing of note: there are no sensu-backend log messages for keepalived.

grepory commented 6 years ago

As I was working on another issue today, I realized that the path from session to keepalives is much shorter:

Session -> MessageBus (TopicKeepalive) -> Keepalived.

So there's no need to look into eventd right now. So something is happening to prevent messages getting to keepalived from the Session. I'm not sure what it is though.

nikkictl commented 6 years ago

It's worth noting in this trace, 1139 entities were registered. Once the following logs are hit on the backend, entities stop registering. However, after killing agent-bench processes, entities do deregister themselves.

Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"warning","msg":"stopping session \u003c%s\u003e: recv error: 696d2136-04c7-4128-bc80-3a3b48fd38f4Connection error: websocket: close 1006 (abnormal closure): unexpected EOF","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"info","msg":"session disconnected - stopping recvPump","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"info","msg":"shutting down - stopping sendPump","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"info","msg":"shutting down - stopping subPump","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"debug","msg":"Unsubscribing from topic \"sensu:check:default:default:default\"","time":"2018-03-20T20:57:43Z"}
Mar 20 20:57:43 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"debug","msg":"Unsubscribing from topic \"sensu:check:default:default:entity:6b5d1424-72ea-4b60-b823-9beeb3d27566\"","time":"2018-03-20T20:57:43Z"}

...

Mar 20 20:58:09 load-test-backend sensu-backend[20881]: {"component":"agentd","level":"error","msg":"transport error on websocket upgrade: websocket: client sent data before handshake is complete","time":"2018-03-20T20:58:09Z"}
Mar 20 20:58:09 load-test-backend sensu-backend[20881]: {"component":"etcd","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pkg":"","time":"2018-03-20T20:58:09Z"}
Mar 20 20:58:09 load-test-backend sensu-backend[20881]: {"component":"etcd","level":"info","msg":"http: response.Write on hijacked connection\n","pkg":"","time":"2018-03-20T20:58:09Z"}

profile001

grepory commented 6 years ago

Oh! This trace is so incredibly helpful. I think what’s happening is we are read locking in the message bus and then never go away. We just need a better mechanism for managing subscribers.

nikkictl commented 6 years ago

Throwing this back a column after discussion with @grepory about potential refactor.

echlebek commented 6 years ago

Oops, I closed this out as #1264 the other day. Fixed!