hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.76k stars 782 forks source link

One template renderer stops working while others continue normally #1203

Open jinksunk opened 5 years ago

jinksunk commented 5 years ago

Consul Template version

$ ./consul-template -v
consul-template v0.19.3 (ebf2d3d)

Config

$ cat /etc/consul_template.d/prometheus.consul_template.conf
template {
  source = "/opt/app/prometheus/var/prometheus-rules.yml"
  destination = "/opt/app/prometheus/var/prometheus-rules.yml.consul"
    command = "svc -h /opt/service/prometheus"
      left_delimiter  = "[["
      right_delimiter = "]]"
  }

Command

./consul-template -log-level info -kill-signal SIGTERM -consul-addr 127.0.0.1:8500 -consul-retry -consul-retry-attempts 0 -consul-retry-backoff 1s -consul-retry-max-backoff 10s -max-stale 2m -config /etc/consul_template.d -template last_updated.prom:/var/prometheus/metrics/consul_template_last_updated.prom

Debug output

We have not caught a similar circumstance with debug/trace logging turned on; here are the info logs showing that the regular (usually every 5 minutes) canary job stopped running, but after a key update, the other renderer worked as expected:

2019/04/01 13:32:33.676276 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 13:37:37.669226 [INFO] (runner) initiating run
2019/04/01 13:37:37.690627 [INFO] (runner) rendered "last_updated.prom" => "/var/prometheus/metrics/consul_template_last_updated.prom"
2019/04/01 13:37:37.691205 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 13:42:46.963737 [INFO] (runner) initiating run
2019/04/01 13:42:46.998530 [INFO] (runner) rendered "last_updated.prom" => "/var/prometheus/metrics/consul_template_last_updated.prom"
2019/04/01 13:42:46.999119 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 16:27:58.180901 [INFO] (runner) initiating run
2019/04/01 16:27:58.191640 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 16:27:58.274394 [INFO] (runner) initiating run
2019/04/01 16:27:58.288077 [INFO] (runner) rendered "/opt/app/prometheus/var/prometheus-rules.yml" => "/opt/app/pro
metheus/var/prometheus-rules.yml.consul"
2019/04/01 16:27:58.289028 [INFO] (runner) executing command "svc -h /opt/service/prometheus" from "/opt/app/promet
heus/var/prometheus-rules.yml" => "/opt/app/prometheus/var/prometheus-rules.yml.consul"
2019/04/01 16:27:58.289102 [INFO] (child) spawning: svc -h /opt/service/prometheus

Expected behavior

What should have happened? In lieu of a health-check endpoint for the server, we setup a canary template that pulls a timestamp from consul and renders it to be scraped by prometheus to ensure that the consul-template service is still running as expected:

$ cat last_updated.prom
# HELP consul_template_last_timestamp_seconds The last timestamp pulled by from consul kv by consul_template.
# TYPE consul_template_last_timestamp_seconds gauge
consul_template_last_timestamp_seconds {{ key "service-overwatch/timestamp" }}

We expected that if one renderer (ie. the timestamp canary) faulted / stopped rendering, that would be an indication that the process as a whole had failed.

Actual behavior

What actually happened?

This weekend, the consul-template process remained up, and continued to render other templates (ie. /opt/app/prometheus/var/prometheus-rules.yml) and produce log messages, but no longer ran the timestamp update template.

Steps to reproduce

While we have not yet found a deterministic process to reproduce this scenario, our concern is that our approach to monitoring the consul-template service is not as effective as we thought.

References

sodabrew commented 5 years ago

Did this ever work on older versions of consul-template? Which ones? Try both versions 0.19.5 and 0.20.0 since they're the most recent releases.

eikenb commented 5 years ago

Hey @jinksunk, thanks for taking the time to report this.

Without a way to reproduce it I'm not sure I'll be able to do much about this other than keep an eye out for it. I think the eventual solution will be adding a service monitoring for consul-template as is discussed in your referenced ticket (#570).

liebman commented 4 years ago

I have seen this too. Also with no deterministic way to reproduce. :-(

eikenb commented 4 years ago

Thanks for chiming in @liebman, if you ever get more information on this please let us know.

I'll continue to keep an eye out for what might be causing thisand will update here if I find anything.