pshima / consul-snapshot

consul-snapshot is a backup and restore utility for Consul (https://www.consul.io). This is slightly different than some other utilities out there as this runs as a daemon for backups and ships them to S3. Also has integrated monitoring and backup health checks.
Apache License 2.0
116 stars 35 forks source link

Consul-snapshot doesn't close connections when agent is running with ui option #17

Closed marcoamorales closed 7 years ago

marcoamorales commented 7 years ago

Hello,

I've found a problem when running consul-snapshot on the same server that is running a consul agent with -ui option.

From what I can tell, if I'm using the -ui option, consul-snapshot doesn't close the connections it creates with the local agent. The connection count will keep growing until the server is no longer able to create more connections.

Shortend output:

# netstat -plant | grep 8500 | grep consul-snap
tcp        0      0 127.0.0.1:39110         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39690         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38900         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:40111         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38768         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39883         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39157         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39722         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39998         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38577         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38911         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39594         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39055         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39449         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39033         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39627         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39868         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
...
tcp        0      0 127.0.0.1:38603         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38669         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39982         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38567         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38646         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39579         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:40064         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:40192         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38591         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:40030         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38977         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39461         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:38490         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39787         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:40176         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39489         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39290         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
tcp        0      0 127.0.0.1:39819         127.0.0.1:8500          ESTABLISHED 4004/consul-snapsho
pshima commented 7 years ago

Hi @marcoamorales this sounds like a bug, thanks for your report. As I have time I will try and reproduce this.

kiwivogel commented 7 years ago

I have a rather similar issue, only in this case consul-snapshot is not running on one of the master nodes. Massive number of connections and corresponding memory usage. We have several consul client nodes with a cluster of three masters. Consul-snapshot is running on one of the client hosts.

kiwivogel commented 7 years ago

I did some more digging. This seems healtcheck related. We were polling the /health endpoint with marathon. Lowering the HC interval decreases the memory/connection growth. Looks like there's an issue with the connection recycling/closing then I guess.

sebamontini commented 7 years ago

same issue here, any update on this?

pshima commented 7 years ago

I think this may be part of the key to what is happening: https://github.com/hashicorp/consul/blob/master/api/api.go#L261

pshima commented 7 years ago

https://github.com/pshima/consul-snapshot/commit/d0b42784764f63100c89b75ff1bae2d65ec23a83

pshima commented 7 years ago

Can you build from master and see if the connection issues go away? If you could help me test this change that would be great as I don't have a long running cluster anymore.

pshima commented 7 years ago

Dropped in the new 2.3.0 release. Going to close this but if its still happening please reopen.

kiwivogel commented 7 years ago

Rolling out 2.3 to production today, thanks for looking into it 👍