sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

Verified leak in Backend #2314

Closed blufor closed 5 years ago

blufor commented 5 years ago

Problem description

About a month of having sensu-go beta deployed, I've discovered and verified with @grepory and @echlebek on Slack that there are leaking gRPC related goroutines. The environment has about 80 agents connected to a 3-node cluster (2cores, 8GB RAM each). Every few hours however, the kernel kills the process because of the node's memory exhaustion as seen in this graph

Last week of sensu01's available memory

Debugging resources

blufor commented 5 years ago

Related Slack thread

grepory commented 5 years ago

Thanks a ton for filing the issue. Looking at the goroutine analysis in the trace, it appears we have a smoking gun:

github.com/sensu/sensu-go/backend/etcd.(*BackendIDGetter).retryAcquireLease N=1 N=19718

20k goroutines trying to acquire a lease on the backend id. We'll get this slated for the next release. Thanks!

grepory commented 5 years ago

I misread the trace.

There's only 1 goroutine for BackendIDgetter. That second N=19718 is the total number of goroutines.

github.com/sensu/sensu-go/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).serveSubstream N=19689

That's the culprit. And it looks like it's happening in eventd. Looking at that now.

grepory commented 5 years ago

@blufor -- do you happen to be using checks with TTLs?

That might explain it. Regardless, can you give us your configured checks?

blufor commented 5 years ago

@grepory https://gist.github.com/blufor/db4544de3fa21946711509d9a6d6dd9b#file-sensuctl-create

blufor commented 5 years ago

btw, I've modified the config so that ttl >= interval + timeout, here's a snip of set of check for one environment:

{"type":"CheckConfig","spec":{"name":"filesystem","command":"check-fs","interval":60,"timeout":30,"ttl":120,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"loadavg","command":"check-load","interval":60,"timeout":30,"ttl":120,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"memory","command":"check-memory","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"puppet-agent-service","command":"check-ps -f '/opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/puppet/bin/puppet agent'","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"sshd-listen","command":"check-tcp -p 22","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}
{"type":"CheckConfig","spec":{"name":"ssh-service","command":"check-ps -f /usr/bin/sshd","interval":30,"timeout":30,"ttl":90,"runtime_assets":["sensu-atc-assets"],"subscriptions":["system"],"publish":true,"organization":"atc","environment":"do"}}

no change in behavor BTW (wasn't expecting any, just to be thorough :wink:)

echlebek commented 5 years ago

I've reproduced the issue in a testing environment. I'll be testing a patch I've made after identifying a source of goroutine leaks in the monitor.

echlebek commented 5 years ago

The patch seems to have alleviated the goroutine leak but heap allocations are still growing unbounded. So it looks like the goroutine leak was unrelated to the heap allocations.

echlebek commented 5 years ago

It turns out there were actually two different goroutine leaks in the check scheduler. I've fixed them, and on my test cluster (3 backends, 75 agents) I am observing a modest heap usage of 133 MB, with around 360 goroutines. These numbers have remained stable for several hours so far. I'll leave it running for 24 hours or so to verify.