Closed blufor closed 5 years ago
Thanks a ton for filing the issue. Looking at the goroutine analysis in the trace, it appears we have a smoking gun:
github.com/sensu/sensu-go/backend/etcd.(*BackendIDGetter).retryAcquireLease N=1 N=19718
20k goroutines trying to acquire a lease on the backend id. We'll get this slated for the next release. Thanks!
I misread the trace.
There's only 1 goroutine for BackendIDgetter. That second N=19718 is the total number of goroutines.
github.com/sensu/sensu-go/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).serveSubstream N=19689
That's the culprit. And it looks like it's happening in eventd. Looking at that now.
@blufor -- do you happen to be using checks with TTLs?
That might explain it. Regardless, can you give us your configured checks?
btw, I've modified the config so that ttl >= interval + timeout
, here's a snip of set of check for one environment:
{"type":"CheckConfig","spec":{"name":"filesystem","command":"check-fs","interval":60,"timeout":30,"ttl":120,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"loadavg","command":"check-load","interval":60,"timeout":30,"ttl":120,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"memory","command":"check-memory","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"puppet-agent-service","command":"check-ps -f '/opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/puppet/bin/puppet agent'","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"sshd-listen","command":"check-tcp -p 22","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"ssh-service","command":"check-ps -f /usr/bin/sshd","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
no change in behavor BTW (wasn't expecting any, just to be thorough :wink:)
I've reproduced the issue in a testing environment. I'll be testing a patch I've made after identifying a source of goroutine leaks in the monitor.
The patch seems to have alleviated the goroutine leak but heap allocations are still growing unbounded. So it looks like the goroutine leak was unrelated to the heap allocations.
It turns out there were actually two different goroutine leaks in the check scheduler. I've fixed them, and on my test cluster (3 backends, 75 agents) I am observing a modest heap usage of 133 MB, with around 360 goroutines. These numbers have remained stable for several hours so far. I'll leave it running for 24 hours or so to verify.
Problem description
About a month of having sensu-go beta deployed, I've discovered and verified with @grepory and @echlebek on Slack that there are leaking gRPC related goroutines. The environment has about 80 agents connected to a 3-node cluster (2cores, 8GB RAM each). Every few hours however, the kernel kills the process because of the node's memory exhaustion as seen in this graph
Debugging resources
sensu-backend version 2.0.0-beta.7-7#0ad392d, build 0ad392d15948e7cc38fdb93a9ada56aa05f42970, built 2018-10-26T17:07:52+0000