hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.76k stars 781 forks source link

retry count not reset #924

Closed mash closed 7 years ago

mash commented 7 years ago

Consul Template version

consul-template v0.18.1 (9c62737)

Configuration

wait {
  min = "5s"
  max = "10s"
}
consul {
  retry {
    enabled = true
    # 11 retries makes total 511.75 seconds
    attempts = 11
    backoff = "250ms"
  }
}

template {
  source      = "/etc/nginx/nginx.upstream.conf.template"
  destination = "/etc/nginx/nginx.upstream.conf"
  command = "/etc/consul-template/nginx-delayed-kill.sh"
  command_timeout = "60s"
  backup = true
}

log_level = "info"
kill_signal = "SIGTERM"
exec {
  command = "nginx"
  splay = "10s"
  reload_signal = "SIGHUP"
  kill_signal = "SIGQUIT"
  kill_timeout = "30s"
}

Command

consul-template \
    -consul-addr $IP:8500 \
    -config /etc/consul-template/config.hcl

Debug output

# leader election

2017/04/12 07:33:47.655224 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: EOF (retry attempt 1 after "250ms")
2017/04/12 07:33:47.905813 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 2 after "500ms")
2017/04/12 07:33:48.406343 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 3 after "1s")
2017/04/12 07:33:49.406945 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 4 after "2s")
2017/04/12 07:33:51.407539 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 5 after "4s")
2017/04/12 07:33:55.408960 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 6 after "8s")
2017/04/12 07:34:03.410952 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 7 after "16s")
2017/04/12 07:34:19.411614 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 8 after "32s")
2017/04/12 07:34:51.412603 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=70444&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 9 after "1m4s")

# leader elected, goes stable
# ...
# leader election

2017/04/13 00:56:24.760368 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=138168&passing=1&stale=&wait=60000ms: read tcp 172.17.0.4:37582->172.31.44.17:8500: read: connection reset by peer (retry attempt 10 after "2m8s")
2017/04/13 00:58:32.760877 [WARN] (view) health.service(admin|passing): Get http://172.31.44.17:8500/v1/health/service/admin?index=138168&passing=1&stale=&wait=60000ms: dial tcp 172.31.44.17:8500: getsockopt: connection refused (retry attempt 11 after "4m16s")

# leader elected, goes stable
# ...
# leader election

2017/04/15 07:34:48.205610 [ERR] (view) health.service(admin|passing): Unexpected response code: 500 (No cluster leader) (exceeded maximum retries)
2017/04/15 07:34:48.205658 [ERR] (runner) watcher reported error: health.service(admin|passing): Unexpected response code: 500 (No cluster leader)
health.service(admin|passing): Unexpected response code: 500 (No cluster leader)

I can collect a trace log but after looking into code I think it's obvious.

Expected behavior

I'm setting the attempts config large enough so that consul-template can withstand a consul server re-election.

The retries variable in poll func https://github.com/hashicorp/consul-template/blob/master/watch/view.go#L108 should reset to 0 when consul-template receives a successful response from consul.

Actual behavior

There are few code paths in fetch func https://github.com/hashicorp/consul-template/blob/master/watch/view.go#L186 that continues the for loop when the response (I guess) should be considered successful.

For example when nothing changed within the wait=60000ms period https://github.com/hashicorp/consul-template/blob/master/watch/view.go#L218

In those continue cases, the retries variable in poll func is not reset to 0, so fetch failure sums up and eventually reaches the retries count.

Steps to reproduce

  1. Use the configuration provided in the top
  2. Kill the consul leader and start election
sethvargo commented 7 years ago

Good find!

imriz commented 5 years ago

Apologies for reviving this, but it seems the fix is incomplete, and retries are not being reset for the consul-template vault token, but only for secrets fetched from Vault. That is, nothing reset the retries here: https://github.com/hashicorp/consul-template/blob/master/dependency/vault_token.go#L71