Resource utilization issues

bmonkman commented 8 years ago

On certain pools of servers I have started seeing issues with consul-template since upgrading from 0.10 to 0.14 where it will start using about 75% of the total cpu on a 2 core vm, then start ramping up memory usage until the box runs out of memory. I'm looking at a consul-template process right now that's using over 5.5GB of memory. I had to send it a HUP to turn on info level logging which seems to have calmed down the cpu usage a bit, but now it's in a state where it sits at next to no cpu usage for a while and then spikes up to 100% usage for a few seconds.

Some strangeness from the logs, after I switched to info logging, for about 10 minutes it spit about 100 log lines per second of this:

Apr 14 11:10:34 ### consul-template[11885]: (runner) quiescence maxTimer fired for /###/service-discovery-ingress-egress.conf
Apr 14 11:10:34 ### consul-template[11885]: (runner) quiescence maxTimer fired for /###/service-discovery-ingress-egress-backends.conf
Apr 14 11:10:34 ### consul-template[11885]: (runner) quiescence maxTimer fired for /###/service-discovery-ingress-egress-multi-dc-backends.conf

That has stopped and now it's just printing it's usual minTimer logging every 5 seconds as configured (though even that is more than I'd expect. It seems to write multiple logs about each configured template: (This is all within 1 second)

Apr 14 11:45:09 ### consul-template[11885]: (runner) quiescence minTimer fired for /###/service-discovery-ingress-egress-backends.conf
Apr 14 11:45:09 ### consul-template[11885]: (runner) received template "/###/service-discovery-ingress-egress-backends.conf" from quiescence
Apr 14 11:45:09 ### consul-template[11885]: (runner) running
Apr 14 11:45:09 ### consul-template[11885]: (runner) diffing and updating dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) watching 3 dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) quiescence minTimer fired for /###/service-discovery-ingress-egress-multi-dc-backends.conf
Apr 14 11:45:09 ### consul-template[11885]: (runner) received template "/###/service-discovery-ingress-egress-multi-dc-backends.conf" from quiescence
Apr 14 11:45:09 ### consul-template[11885]: (runner) running
Apr 14 11:45:09 ### consul-template[11885]: (runner) quiescence minTimer fired for /###/service-discovery-ingress-egress.conf
Apr 14 11:45:09 ### consul-template[11885]: (runner) diffing and updating dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) watching 3 dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) received template "/###/service-discovery-ingress-egress.conf" from quiescence
Apr 14 11:45:09 ### consul-template[11885]: (runner) running
Apr 14 11:45:09 ### consul-template[11885]: (runner) diffing and updating dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) watching 3 dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) quiescence minTimer fired for /###/service-discovery-ingress-egress-multi-dc-backends.conf
Apr 14 11:45:09 ### consul-template[11885]: (runner) received template "/###/service-discovery-ingress-egress-multi-dc-backends.conf" from quiescence
Apr 14 11:45:09 ### consul-template[11885]: (runner) running
Apr 14 11:45:09 ### consul-template[11885]: (runner) diffing and updating dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) watching 3 dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) quiescence minTimer fired for /###/service-discovery-ingress-egress-backends.conf
Apr 14 11:45:09 ### consul-template[11885]: (runner) received template "/###/service-discovery-ingress-egress-backends.conf" from quiescence
Apr 14 11:45:09 ### consul-template[11885]: (runner) running
Apr 14 11:45:09 ### consul-template[11885]: (runner) diffing and updating dependencies
Apr 14 11:45:09 ### consul-template[11885]: (runner) watching 3 dependencies

Here is the config:

consul = "localhost:###"
retry = "10s"
max_stale = "1m"
wait = "5s:600s"
log_level = "info"

syslog {
  enabled = true
  facility = "LOCAL7"
}

token = "###"

template {
  source = "/###/service-discovery-ingress-egress-backends.conf"
  destination = "/###/service-discovery-ingress-egress-backends.conf"
  command = "service nginx reload; touch /var/tmp/nginx_reload"
  backup = true
}

template {
  source = "/###/service-discovery-ingress-egress-multi-dc-backends.conf"
  destination = "/###/service-discovery-ingress-egress-multi-dc-backends.conf"
  command = "service nginx reload; touch /var/tmp/nginx_reload"
  backup = true
}

template {
  source = "/###/service-discovery-ingress-egress.conf"
  destination = "/###/service-discovery-ingress-egress.conf"
  command = "service nginx reload; touch /var/tmp/nginx_reload"
  backup = true
}

10 minutes max quiescence is pretty overkill, but I increased that at one point to try to keep the number of nginx reloads to a minimum.

The cluster is pretty quiet so I wouldn't expect any quiescence issues anway. Watching the consul catalog/services api index header shows it changing very infrequently, there aren't hosts flapping or anything. All health checks have pretty static, deterministic output so shouldn't be causing churn in the catalog.

Here's some more info about the memory usage:

$ ps -e o pid,command,pmem,rsz,vsz k +rsz
11885 /usr/bin/consul-template 66.0 5208788 5268888

$ sudo pmap 11885
11885:   /usr/bin/consul-template
0000000000400000   4060K r-x--  /usr/bin/consul-template
00000000007f7000   4268K r----  /usr/bin/consul-template
0000000000c22000    112K rw---  /usr/bin/consul-template
0000000000c3e000    144K rw---    [ anon ]
000000c000000000   4908K rw---    [ anon ]
000000c8166aa000 5181336K rw---    [ anon ]
00007f3b10374000  33952K rw---    [ anon ]
00007f3b124b6000   9732K rw---    [ anon ]
00007f3b12e49000   8960K rw---    [ anon ]
00007f3b1373f000   4612K rw---    [ anon ]
00007f3b13bfa000   2304K rw---    [ anon ]
00007f3b13e4b000   2048K rw---    [ anon ]
00007f3b14081000   1280K rw---    [ anon ]
00007f3b141e5000    768K rw---    [ anon ]
00007f3b142bd000   3712K rw---    [ anon ]
00007f3b1466d000    256K rw---    [ anon ]
00007fff2e6eb000     84K rw---    [ stack ]
00007fff2e7ff000      4K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total          5262544K

$ sudo cat /proc/11885/status
Name:   consul-template
State:  S (sleeping)
Tgid:   11885
Pid:    11885
PPid:   1
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 64
Groups:
VmPeak:  5379036 kB
VmSize:  5379036 kB
VmLck:         0 kB
VmHWM:   5366404 kB
VmRSS:   5366404 kB
VmData:  5370508 kB
VmStk:        88 kB
VmExe:      4060 kB
VmLib:         0 kB
VmPTE:     10524 kB
Threads:    9
SigQ:   0/59432
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffe7bfa3a20
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   3
Cpus_allowed_list:  0-1
Mems_allowed:   1
Mems_allowed_list:  0
voluntary_ctxt_switches:    27
nonvoluntary_ctxt_switches: 4

Any thoughts, or any more info needed?

Thanks!

sethvargo commented 8 years ago

Hi @bmonkman

Can you dump the complete debug logs in a gist?

bmonkman commented 8 years ago

I had to restart consul-template on that box so we could use it again and right now there don't seem to be any in that degraded state. I can get some debug logs from one that's not yet in that state though if that would be helpful.

bmonkman commented 8 years ago

Here's one minute of debug logs from that server while not in a degraded state: https://gist.github.com/bmonkman/7cceb7c9e4cb7502ab2905665634e43c

wuub commented 8 years ago

@sethvargo I can confirm this issue, I've been fighting with it on Thursday last week and noticed this report as I started to write my own.

I want to stress that when this happens it can be quite serious, to the point of bringing a whole node down. In my case it created a very tight loop of disk reads and CPU burn, easily overwhelming smaller instances. To the point where they dropped from consul's memberlist [sic!] and I couldn't even use pre-established SSH connections to debug anything.

Some background: it happened in a small cluster in two separate DCs on Google Compute Engine:

main, 3 dedicated consul (0.6.4) servers, 30+ nodes, stable state (no flapping services, no writes to KV, almost no activity on raft AFAICT)
spoke DC, 1 server (a single node handling both consul server and app node role, bootstrap_expect=1), also extremely stable state

Consul-template was running with following options for a few weeks.

consul = "127.0.0.1:8500"
max_stale = "1m"
wait = "5s:60s"
token = "-------"
retry = "10s"

template { (...) }
(...)

But during that time it handled at most 1 template per machine. As soon as I put 3-5 more templates (rendering a single, rarely changing value from K/V each,) instances started to experience very high CPU load and high disk read rates. No other metric was affected (network, disk writes, free memory, ...). Unfortunately for us it does not trigger immediately, but after 15m-120m.

To investigate I capped consul-template (0.14.0, linux amd64, from releases.HC) to a reasonable CPU quota

[Service]
CPUAccounting=true
CPUQuota=30%

Once it triggered I was able to easily investigate, finding almost identical logs as @bmonkman's. Stracing consul-template revealed a flood of time related syscalls (CPU burn):

[pid 32228] clock_gettime(CLOCK_MONOTONIC, {..., ...}) = 0
[pid 32228] clock_gettime(CLOCK_REALTIME, {..., ...}) = 0

while at the same time received template from quiescence timer events caused consul-template to read each of the rendered files checking them for changes (very high IO read). Files did not need updating, so nothing was written to disk.

As a workaround I removed

wait = "5s:60s"

and it "fixed" the issue, suggesting that something in quiescence timers if firing all the time w/o any obvious external cause.

jippi commented 8 years ago

Do any of you use Vault as part of the cluster?

https://github.com/hashicorp/consul-template/commit/5b955a8999a29b01e301cc9b9ac7c4b7b714685f could be relevant in that case :)

bmonkman commented 8 years ago

Not I. (Not yet, anyway.)

On Apr 18, 2016, at 9:13 AM, Christian Winther notifications@github.com wrote:

Do any of you use Vault as part of the cluster?

5b955a8 could be relevant in that case :)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

wuub commented 8 years ago

@jippi nope, not in this cluster.

slackpad commented 8 years ago

I'll try to repro this with some additional instrumentation to see what's going on.

jcua commented 8 years ago

We are using 0.14 as well. This issue started this morning when we deployed another template on a set of 85 boxes.

Here is our config.

consul = "XXX:8500"
token = "YYY"

vault {
  address = "https://XXX:8200"
  renew   = true
  ssl {
    enabled = true
    verify  = true
    ca_cert = "XXX"
  }
}

retry = "10s"
max_stale = "10s"

ssl {
  enabled = true
  verify  = true
  ca_cert = "XXX"
}

log_level = "info"
syslog {
  enabled  = true
  facility = "local7"
}

deduplicate {
  enabled = true
  prefix  = "consul-template/dedup/"
}

Then suddenly we saw consul-template CPU went from 1% to 20%. We basically saw all these boxes doing a DDOS on our consul cluster. On the consul that is the leader of the template, we saw the RPC query jumped from 200 to 3000 queries and there was a corresponding jump in the CPU usage.

Here is a snippet of the log. Note I've narrowed this down, but all of these were happening in 1 second.

2016/04/19 11:21:10 [INFO] (dedup) loading 9 dependencies from 'consul-template/dedup/XXX/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash XXX
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 15 dependencies from 'consul-template/dedup/YYY/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash YYY
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 9 dependencies from 'consul-template/dedup/XXX/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash XXX
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 15 dependencies from 'consul-template/dedup/YYY/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash YYY
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 9 dependencies from 'consul-template/dedup/XXX/data'

Tests:

We disabled the "deduplicate" option and restarted consul-template on one of the boxes and the problem went away. But when we re-enabled "deduplicate" option on this same box, the same problem came back. We did this because we saw this line: watcher triggered by de-duplication manager thinking that something is not right with the dedup manager.
Delete one of the template key (this corresponds to the newer template that we deployed). The problem became worse. On the consul that is the leader of the template, the RPC query jumped to 6000 queries and CPU usage became twice.
We slowly restarted the boxes with "deduplicate" option off. After disabling this and restarting consul-template on 12 boxes, we noticed the problem went away and RPC query on the consul server went back to 200. Then we were thinking that maybe there was a limit that we reached or something... so we re-enabled "deduplicate" option on those 12 boxes and started restarting consul-template on all of them (one at a time)... and the problem never showed up anymore.

sethvargo commented 8 years ago

We might need to bring @armon in on the thread, since he's the primary author of dedupe. That being said, I opened a PR that fixes an issue with dedupe, and I'm curious if it will impact this. Could you apply https://github.com/hashicorp/consul-template/pull/611 and compile from source to see if the issue persists?

bmonkman commented 8 years ago

I certainly won't complain about bringing Armon in but I'm not using the dedupe option and I'm seeing this issue.

On Apr 19, 2016, at 9:42 PM, Seth Vargo notifications@github.com wrote:

We might need to bring @armon in on the thread, since he's the primary author of dedupe. That being said, I opened a PR that fixes an issue with dedupe, and I'm curious if it will impact this. Could you apply #611 and compile from source to see if the issue persists?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

jcua commented 8 years ago

So our problem is not there anymore. If it appears again, I will certainly apply #611. But it feels like there is some similarities with our issue and what the others have reported... high resource usage, endless loop. At the same time, they are not the same since the log messages and workarounds are different.

slackpad commented 8 years ago

Spent some time looking through the code and I think there are some possible race conditions around quiescence that could be causing this. The quiescence tick() uses the lock when it changes state, but start() does not do any locking. Also, ch is unbuffered, so start() could get blocked writing to that channel if the main runner loop is busy, causing it to get stuck in the select and miss the closing of the stopCh. Since this changes the member stop channel, and this waits on the stopCh member variable I think we can miss the close() and start waiting on the new channel assigned to the member, leaking a goroutine.

This probably only happens with multiple templates and is probably more likely when they depend on things that change at around the same time, making their timers fire together.

Not sure quite sure yet the best way to fix this. If start() took the minTimer, maxTimer, and stopCh as variables then I think it would not be affected when we re-assign the member variables in tick(). It might be simpler to just use a time.Timer() though and reset it, rather than tearing down the goroutine all the time.

wuub commented 8 years ago

more likely when they depend on things that change at around the same time, making their timers fire together.

I can confirm this assumption in our case. Two of the templates share key from consul's KV.

bmonkman commented 8 years ago

This probably only happens with multiple templates and is probably more likely when they depend on things that change at around the same time, making their timers fire together.

This is true in our case as well.

slackpad commented 8 years ago

Since I didn't have a good local repro I'll keep this open for now.

dcharbonnier commented 8 years ago

same problem for me, will try to get some logs

bmonkman commented 8 years ago

@slackpad I'm also seeing an issue where given two templates that would be re-rendered at the same time (for the same change in available services), one of them is rendered and the other churns in quiescence until reaching it's max wait time, even on a network that is very quiet and should not have any quiescence issues.) Could this be related?

sethvargo commented 8 years ago

Hey @bmonkman

Are you able to try the master branch and see if the issue reproduces? James has made a ton of improvements in this area, but we don't have a solid reproduction case. It seems like you can reproduce this pretty reliably, so it'd be helpful to see if you can reproduce it now that we've removed the race conditions for the timers.

bmonkman commented 8 years ago

Yeah I'll give it a shot today.

On May 3, 2016, at 7:49 AM, Seth Vargo notifications@github.com wrote:

Hey @bmonkman

Are you able to try the master branch and see if the issue reproduces? James has made a ton of improvements in this area, but we don't have a solid reproduction case. It seems like you can reproduce this pretty reliably, so it'd be helpful to see if you can reproduce it now that we've removed the race conditions for the timers.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

bmonkman commented 8 years ago

Awesome. I did a side-by-side test with a few servers, one of them using the master version, and that one updated both templates at the same time correctly, where the others only rendered one. I'm not able to reliably repro the resource utilization issue, but in general the info-level output looks way more sane on the new version.. Thanks a lot @sethvargo @slackpad

sethvargo commented 8 years ago

Hey @bmonkman et all,

I just released 0.15.0. Could you please give it a try and see if the issue is still occurring? If so, can you post the debug output as a Gist? Thanks!

bmonkman commented 8 years ago

I verified the fix in #616 , have been running it for a month now with no problems. Might not have time to upgrade to 0.15.0 stable for a bit though. In any case, I consider this one fixed.

slackpad commented 8 years ago

Thanks for the feedback @bmonkman!

wuub commented 8 years ago

FWIW I upgraded our cluster to 0.15.0 and uncommented waits in configuration. If nothing comes up for serveral hours I'll also be able to confirm this as fixed :) Thanks all :)

ghost commented 8 years ago

I'm afraid we're still experiencing this issue with consul-template 0.15. The logs I've collected in debug mode haven't shed a light on the root cause, will keep at it!

ghost commented 8 years ago

This does however appear to be a recurring theme:

[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12943] read(43, "", 512)           = 0
[pid 12943] close(43)                   = 0
[pid 12943] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 12939] <... select resumed> )      = 0 (Timeout)
[pid 12943] <... clock_gettime resumed> {1468945910, 58922679}) = 0
[pid 12939] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 12943] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 12939] <... clock_gettime resumed> {1535466, 432192345}) = 0
[pid 12943] <... clock_gettime resumed> {1468945910, 59005315}) = 0
[pid 12939] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 12943] futex(0xc8201ba108, FUTEX_WAKE, 1 <unfinished ...>
[pid 12939] <... clock_gettime resumed> {1468945910, 59108394}) = 0
[pid 12944] <... futex resumed> )       = 0
[pid 12944] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 12943] <... futex resumed> )       = 1
[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12944] <... select resumed> )      = 0 (Timeout)
[pid 12944] futex(0xc8201ba108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12943] clock_gettime(CLOCK_REALTIME, {1468945910, 59911763}) = 0
[pid 12939] <... select resumed> )      = 0 (Timeout)
[pid 12943] futex(0xc8201ba108, FUTEX_WAKE, 1 <unfinished ...>
[pid 12939] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 12944] <... futex resumed> )       = 0
[pid 12944] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 12943] <... futex resumed> )       = 1
[pid 12939] <... clock_gettime resumed> {1535466, 433244196}) = 0
[pid 12943] write(20, "PUT /v1/kv/consul-template/dedup"..., 1598 <unfinished ...>
[pid 12939] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 12944] <... select resumed> )      = 0 (Timeout)
[pid 12944] futex(0xc8201ba108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12943] <... write resumed> )       = 1598
[pid 12939] <... clock_gettime resumed> {1468945910, 60390612}) = 0
[pid 12943] futex(0xc8201ba108, FUTEX_WAKE, 1 <unfinished ...>
[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12944] <... futex resumed> )       = 0
[pid 12944] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 12943] <... futex resumed> )       = 1
[pid 12943] futex(0xc82001d908, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12939] <... select resumed> )      = 0 (Timeout)
[pid 12939] clock_gettime(CLOCK_MONOTONIC, {1535466, 433873772}) = 0
[pid 12939] clock_gettime(CLOCK_REALTIME, {1468945910, 60710385}) = 0
[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12944] <... select resumed> )      = 0 (Timeout)
[pid 12944] futex(0xc8201ba108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12939] <... select resumed> )      = 0 (Timeout)
[pid 12939] clock_gettime(CLOCK_MONOTONIC, {1535466, 434066141}) = 0
[pid 12939] futex(0xc41970, FUTEX_WAIT, 0, {60, 0}^CProcess 12937 detached

etc

slackpad commented 8 years ago

Hi @grobinson-blockchain this might be good to chase down in a separate issue linked to this one. Can you please open a new issue and share any config/repro info you can?

skyrocknroll commented 8 years ago

We are also facing the same issue. I have restarted consul -template. Next time when it occurs i will all the details. It does not happen immediately it happens after sometime restart.

hashicorp / consul-template

Resource utilization issues #608