Closed bmonkman closed 8 years ago
Hi @bmonkman
Can you dump the complete debug logs in a gist?
I had to restart consul-template on that box so we could use it again and right now there don't seem to be any in that degraded state. I can get some debug logs from one that's not yet in that state though if that would be helpful.
Here's one minute of debug logs from that server while not in a degraded state: https://gist.github.com/bmonkman/7cceb7c9e4cb7502ab2905665634e43c
@sethvargo I can confirm this issue, I've been fighting with it on Thursday last week and noticed this report as I started to write my own.
I want to stress that when this happens it can be quite serious, to the point of bringing a whole node down. In my case it created a very tight loop of disk reads and CPU burn, easily overwhelming smaller instances. To the point where they dropped from consul's memberlist [sic!] and I couldn't even use pre-established SSH connections to debug anything.
Some background: it happened in a small cluster in two separate DCs on Google Compute Engine:
Consul-template was running with following options for a few weeks.
consul = "127.0.0.1:8500"
max_stale = "1m"
wait = "5s:60s"
token = "-------"
retry = "10s"
template { (...) }
(...)
But during that time it handled at most 1 template per machine. As soon as I put 3-5 more templates (rendering a single, rarely changing value from K/V each,) instances started to experience very high CPU load and high disk read rates. No other metric was affected (network, disk writes, free memory, ...). Unfortunately for us it does not trigger immediately, but after 15m-120m.
To investigate I capped consul-template
(0.14.0, linux amd64, from releases.HC) to a reasonable CPU quota
[Service]
CPUAccounting=true
CPUQuota=30%
Once it triggered I was able to easily investigate, finding almost identical logs as @bmonkman's.
Stracing consul-template
revealed a flood of time related syscalls (CPU burn):
[pid 32228] clock_gettime(CLOCK_MONOTONIC, {..., ...}) = 0
[pid 32228] clock_gettime(CLOCK_REALTIME, {..., ...}) = 0
while at the same time received template
from quiescence timer
events caused consul-template to read each of the rendered files checking them for changes (very high IO read). Files did not need updating, so nothing was written to disk.
As a workaround I removed
wait = "5s:60s"
and it "fixed" the issue, suggesting that something in quiescence timers if firing all the time w/o any obvious external cause.
Do any of you use Vault as part of the cluster?
https://github.com/hashicorp/consul-template/commit/5b955a8999a29b01e301cc9b9ac7c4b7b714685f could be relevant in that case :)
Not I. (Not yet, anyway.)
On Apr 18, 2016, at 9:13 AM, Christian Winther notifications@github.com wrote:
Do any of you use Vault as part of the cluster?
5b955a8 could be relevant in that case :)
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
@jippi nope, not in this cluster.
I'll try to repro this with some additional instrumentation to see what's going on.
We are using 0.14 as well. This issue started this morning when we deployed another template on a set of 85 boxes.
Here is our config.
consul = "XXX:8500"
token = "YYY"
vault {
address = "https://XXX:8200"
renew = true
ssl {
enabled = true
verify = true
ca_cert = "XXX"
}
}
retry = "10s"
max_stale = "10s"
ssl {
enabled = true
verify = true
ca_cert = "XXX"
}
log_level = "info"
syslog {
enabled = true
facility = "local7"
}
deduplicate {
enabled = true
prefix = "consul-template/dedup/"
}
Then suddenly we saw consul-template CPU went from 1% to 20%. We basically saw all these boxes doing a DDOS on our consul cluster. On the consul that is the leader of the template, we saw the RPC query jumped from 200 to 3000 queries and there was a corresponding jump in the CPU usage.
Here is a snippet of the log. Note I've narrowed this down, but all of these were happening in 1 second.
2016/04/19 11:21:10 [INFO] (dedup) loading 9 dependencies from 'consul-template/dedup/XXX/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash XXX
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 15 dependencies from 'consul-template/dedup/YYY/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash YYY
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 9 dependencies from 'consul-template/dedup/XXX/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash XXX
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 15 dependencies from 'consul-template/dedup/YYY/data'
2016/04/19 11:21:10 [INFO] (dedup) listing data for template hash YYY
2016/04/19 11:21:10 [INFO] (runner) watcher triggered by de-duplication manager
2016/04/19 11:21:10 [INFO] (runner) running
2016/04/19 11:21:10 [INFO] (runner) diffing and updating dependencies
2016/04/19 11:21:10 [INFO] (runner) watching 0 dependencies
2016/04/19 11:21:10 [INFO] (dedup) loading 9 dependencies from 'consul-template/dedup/XXX/data'
Tests:
watcher triggered by de-duplication manager
thinking that something is not right with the dedup manager.We might need to bring @armon in on the thread, since he's the primary author of dedupe. That being said, I opened a PR that fixes an issue with dedupe, and I'm curious if it will impact this. Could you apply https://github.com/hashicorp/consul-template/pull/611 and compile from source to see if the issue persists?
I certainly won't complain about bringing Armon in but I'm not using the dedupe option and I'm seeing this issue.
On Apr 19, 2016, at 9:42 PM, Seth Vargo notifications@github.com wrote:
We might need to bring @armon in on the thread, since he's the primary author of dedupe. That being said, I opened a PR that fixes an issue with dedupe, and I'm curious if it will impact this. Could you apply #611 and compile from source to see if the issue persists?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
So our problem is not there anymore. If it appears again, I will certainly apply #611. But it feels like there is some similarities with our issue and what the others have reported... high resource usage, endless loop. At the same time, they are not the same since the log messages and workarounds are different.
Spent some time looking through the code and I think there are some possible race conditions around quiescence that could be causing this. The quiescence
tick()
uses the lock when it changes state, but start()
does not do any locking. Also, ch
is unbuffered, so start()
could get blocked writing to that channel if the main runner loop is busy, causing it to get stuck in the select
and miss the closing of the stopCh
. Since this changes the member stop channel, and this waits on the stopCh
member variable I think we can miss the close()
and start waiting on the new channel assigned to the member, leaking a goroutine.
This probably only happens with multiple templates and is probably more likely when they depend on things that change at around the same time, making their timers fire together.
Not sure quite sure yet the best way to fix this. If start()
took the minTimer
, maxTimer
, and stopCh
as variables then I think it would not be affected when we re-assign the member variables in tick()
. It might be simpler to just use a time.Timer()
though and reset it, rather than tearing down the goroutine all the time.
more likely when they depend on things that change at around the same time, making their timers fire together.
I can confirm this assumption in our case. Two of the templates share key from consul's KV.
This probably only happens with multiple templates and is probably more likely when they depend on things that change at around the same time, making their timers fire together.
This is true in our case as well.
Since I didn't have a good local repro I'll keep this open for now.
same problem for me, will try to get some logs
@slackpad I'm also seeing an issue where given two templates that would be re-rendered at the same time (for the same change in available services), one of them is rendered and the other churns in quiescence until reaching it's max wait time, even on a network that is very quiet and should not have any quiescence issues.) Could this be related?
Hey @bmonkman
Are you able to try the master
branch and see if the issue reproduces? James has made a ton of improvements in this area, but we don't have a solid reproduction case. It seems like you can reproduce this pretty reliably, so it'd be helpful to see if you can reproduce it now that we've removed the race conditions for the timers.
Yeah I'll give it a shot today.
On May 3, 2016, at 7:49 AM, Seth Vargo notifications@github.com wrote:
Hey @bmonkman
Are you able to try the master branch and see if the issue reproduces? James has made a ton of improvements in this area, but we don't have a solid reproduction case. It seems like you can reproduce this pretty reliably, so it'd be helpful to see if you can reproduce it now that we've removed the race conditions for the timers.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
Awesome. I did a side-by-side test with a few servers, one of them using the master version, and that one updated both templates at the same time correctly, where the others only rendered one. I'm not able to reliably repro the resource utilization issue, but in general the info-level output looks way more sane on the new version.. Thanks a lot @sethvargo @slackpad
Hey @bmonkman et all,
I just released 0.15.0. Could you please give it a try and see if the issue is still occurring? If so, can you post the debug output as a Gist? Thanks!
I verified the fix in #616 , have been running it for a month now with no problems. Might not have time to upgrade to 0.15.0 stable for a bit though. In any case, I consider this one fixed.
Thanks for the feedback @bmonkman!
FWIW I upgraded our cluster to 0.15.0 and uncommented waits in configuration. If nothing comes up for serveral hours I'll also be able to confirm this as fixed :) Thanks all :)
I'm afraid we're still experiencing this issue with consul-template 0.15. The logs I've collected in debug mode haven't shed a light on the root cause, will keep at it!
This does however appear to be a recurring theme:
[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12943] read(43, "", 512) = 0
[pid 12943] close(43) = 0
[pid 12943] clock_gettime(CLOCK_REALTIME, <unfinished ...>
[pid 12939] <... select resumed> ) = 0 (Timeout)
[pid 12943] <... clock_gettime resumed> {1468945910, 58922679}) = 0
[pid 12939] clock_gettime(CLOCK_MONOTONIC, <unfinished ...>
[pid 12943] clock_gettime(CLOCK_REALTIME, <unfinished ...>
[pid 12939] <... clock_gettime resumed> {1535466, 432192345}) = 0
[pid 12943] <... clock_gettime resumed> {1468945910, 59005315}) = 0
[pid 12939] clock_gettime(CLOCK_REALTIME, <unfinished ...>
[pid 12943] futex(0xc8201ba108, FUTEX_WAKE, 1 <unfinished ...>
[pid 12939] <... clock_gettime resumed> {1468945910, 59108394}) = 0
[pid 12944] <... futex resumed> ) = 0
[pid 12944] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 12943] <... futex resumed> ) = 1
[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12944] <... select resumed> ) = 0 (Timeout)
[pid 12944] futex(0xc8201ba108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12943] clock_gettime(CLOCK_REALTIME, {1468945910, 59911763}) = 0
[pid 12939] <... select resumed> ) = 0 (Timeout)
[pid 12943] futex(0xc8201ba108, FUTEX_WAKE, 1 <unfinished ...>
[pid 12939] clock_gettime(CLOCK_MONOTONIC, <unfinished ...>
[pid 12944] <... futex resumed> ) = 0
[pid 12944] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 12943] <... futex resumed> ) = 1
[pid 12939] <... clock_gettime resumed> {1535466, 433244196}) = 0
[pid 12943] write(20, "PUT /v1/kv/consul-template/dedup"..., 1598 <unfinished ...>
[pid 12939] clock_gettime(CLOCK_REALTIME, <unfinished ...>
[pid 12944] <... select resumed> ) = 0 (Timeout)
[pid 12944] futex(0xc8201ba108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12943] <... write resumed> ) = 1598
[pid 12939] <... clock_gettime resumed> {1468945910, 60390612}) = 0
[pid 12943] futex(0xc8201ba108, FUTEX_WAKE, 1 <unfinished ...>
[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12944] <... futex resumed> ) = 0
[pid 12944] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 12943] <... futex resumed> ) = 1
[pid 12943] futex(0xc82001d908, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12939] <... select resumed> ) = 0 (Timeout)
[pid 12939] clock_gettime(CLOCK_MONOTONIC, {1535466, 433873772}) = 0
[pid 12939] clock_gettime(CLOCK_REALTIME, {1468945910, 60710385}) = 0
[pid 12939] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 12944] <... select resumed> ) = 0 (Timeout)
[pid 12944] futex(0xc8201ba108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12939] <... select resumed> ) = 0 (Timeout)
[pid 12939] clock_gettime(CLOCK_MONOTONIC, {1535466, 434066141}) = 0
[pid 12939] futex(0xc41970, FUTEX_WAIT, 0, {60, 0}^CProcess 12937 detached
etc
Hi @grobinson-blockchain this might be good to chase down in a separate issue linked to this one. Can you please open a new issue and share any config/repro info you can?
We are also facing the same issue. I have restarted consul -template. Next time when it occurs i will all the details. It does not happen immediately it happens after sometime restart.
On certain pools of servers I have started seeing issues with consul-template since upgrading from 0.10 to 0.14 where it will start using about 75% of the total cpu on a 2 core vm, then start ramping up memory usage until the box runs out of memory. I'm looking at a consul-template process right now that's using over 5.5GB of memory. I had to send it a HUP to turn on info level logging which seems to have calmed down the cpu usage a bit, but now it's in a state where it sits at next to no cpu usage for a while and then spikes up to 100% usage for a few seconds.
Some strangeness from the logs, after I switched to info logging, for about 10 minutes it spit about 100 log lines per second of this:
That has stopped and now it's just printing it's usual minTimer logging every 5 seconds as configured (though even that is more than I'd expect. It seems to write multiple logs about each configured template: (This is all within 1 second)
Here is the config:
10 minutes max quiescence is pretty overkill, but I increased that at one point to try to keep the number of nginx reloads to a minimum.
The cluster is pretty quiet so I wouldn't expect any quiescence issues anway. Watching the consul catalog/services api index header shows it changing very infrequently, there aren't hosts flapping or anything. All health checks have pretty static, deterministic output so shouldn't be causing churn in the catalog.
Here's some more info about the memory usage:
Any thoughts, or any more info needed?
Thanks!