hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.76k stars 782 forks source link

can't evaluate field Node in type *dependency.Node #1256

Open aclowkey opened 5 years ago

aclowkey commented 5 years ago

consul-template v0.20.0 (b709612c)

Configuration

{{ with node -}}
node:
  id: {{ .Node.Node }}
  {{- range .Services }}
  {{ .Tags }}
  {{- end -}}
{{ end }}
~                        

Command

consul-template -template 'test.template:test.txt' -once -log-level trace

Expected behavior

Some fault tolerance logic, restart, etc

Actual behavior

2019/08/13 06:16:17.371648 [WARN] (view) catalog.node: Unexpected response code: 500 (No known Consul servers) (retry attempt 1 after "250ms") 2019/08/13 06:16:17.629392 [WARN] (view) catalog.node: Unexpected response code: 500 (No known Consul servers) (retry attempt 2 after "500ms") 2019/08/13 06:16:18.132668 [WARN] (view) catalog.node: Unexpected response code: 500 (No known Consul servers) (retry attempt 3 after "1s") 2019/08/13 06:16:19.137675 [WARN] (view) catalog.node: Unexpected response code: 500 (No known Consul servers) (retry attempt 4 after "2s") 2019/08/13 06:16:21.144354 [WARN] catalog.node: no node exists with the name "" 2019/08/11 07:44:01.325185 [ERR] (cli) /opt/xds/services.csv.gotemplate: execute: template: :2:21: executing "" at <.Node.Node>: can't evaluate field Node in type *dependency.Node

Steps to reproduce

  1. Stop agent
  2. Immediately run consul-template command

I was only able to reproduce this by stopping the agent, but it also happens otherwise, such as before a quorum is ready etc.

eikenb commented 5 years ago

Hey @aclowkey, thanks for the bug report.

However, I am unable to reproduce this. Is there some timing involved or some other hint you can give to help me reproduce it? When I run that template with no consul running, it just retries indefinitely and then immediately renders when I start up consul. I never get an error.

Thanks.

aclowkey commented 5 years ago

Hi @eikenb

Yes, the scenario you are describing is working. However, try the one described in "steps to reproduce", that is: run consul-template quickly after stopping the agent.

My suspicion is the agent de-registers itself so the node data doesn't appear there.

Also note that it's just a way to reproduce a symptom, the real issue isn't when the agent is stopped but at some other state which I'm not entirely sure of. At this moment, we're trying work around this by waiting for consul to be ready before any consul-template is run.

eikenb commented 5 years ago

Thanks for the additional details.

I did try the steps to reproduce, but couldn't get the timing right. It was always either completely down and retrying or up and working. I wasn't able to hit that middle state you are. How did you stop your agent, maybe that matters?

Maybe I can just make it more resilient to unexpected responses in general.

If you figure out anything else about reproducing it please let me know.

Thanks.

eikenb commented 5 years ago

It might help to see the log output of it being run at the trace log level... maybe you could post that as well.

/opt/consul-template/bin/consul-template -once \
    -log-level=trace \
    -config=/opt/consul-template/conf/cfg.hcl

Thanks.

aclowkey commented 5 years ago

I found a different example, see in the edited topic.

When running in trace:

2019/08/21 11:00:58.773993 [INFO] consul-template v0.20.0 (b709612c) 2019/08/21 11:00:58.774020 [INFO] (runner) creating new runner (dry: false, once: true) 2019/08/21 11:00:58.774621 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":3,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"trace","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":"","CreateDestDirs":true,"Destination":"test.txt","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":0,"Source":"test.template","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"","RightDelim":""}],"Vault":{"Address":"","Enabled":false,"Grace":15000000000,"Namespace":"","RenewToken":true,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":true,"Key":"","ServerName":"","Verify":true},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":3,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}} 2019/08/21 11:00:58.774703 [INFO] (runner) creating watcher 2019/08/21 11:00:58.774912 [INFO] (runner) starting 2019/08/21 11:00:58.774931 [DEBUG] (runner) running initial templates 2019/08/21 11:00:58.774937 [DEBUG] (runner) initiating run 2019/08/21 11:00:58.774994 [DEBUG] (runner) checking template 95e58447ff057cc5586ec79d8a1c1992 2019/08/21 11:00:58.775296 [DEBUG] (runner) was not watching 1 dependencies 2019/08/21 11:00:58.775348 [DEBUG] (watcher) adding catalog.node 2019/08/21 11:00:58.775365 [TRACE] (watcher) catalog.node starting 2019/08/21 11:00:58.775380 [DEBUG] (runner) diffing and updating dependencies 2019/08/21 11:00:58.775393 [DEBUG] (runner) watching 1 dependencies 2019/08/21 11:00:58.775424 [TRACE] (view) catalog.node starting fetch 2019/08/21 11:00:58.775440 [TRACE] catalog.node: getting local agent name 2019/08/21 11:00:58.779839 [TRACE] catalog.node: GET /v1/catalog/node/api-preemptible-xnrv?stale=true&wait=1m0s 2019/08/21 11:00:58.781155 [TRACE] catalog.node: returned response 2019/08/21 11:00:58.781176 [WARN] catalog.node: no node exists with the name "api-preemptible-xnrv" 2019/08/21 11:00:58.781192 [TRACE] (view) catalog.node marking successful data response 2019/08/21 11:00:58.781210 [TRACE] (view) catalog.node successful contact, resetting retries 2019/08/21 11:00:58.781227 [TRACE] (view) catalog.node received data 2019/08/21 11:00:58.781243 [DEBUG] (runner) receiving dependency catalog.node 2019/08/21 11:00:58.781298 [DEBUG] (runner) initiating run 2019/08/21 11:00:58.781314 [DEBUG] (runner) checking template 95e58447ff057cc5586ec79d8a1c1992 2019/08/21 11:00:58.781544 [ERR] (cli) test.template: execute: template: :3:14: executing "" at <.Node.Node>: can't evaluate field Node in type *dependency.Node

Nmishin commented 4 years ago

Hi! Can I add my 5 cents here? Seems we have the same strange issue here. We use consul-template v0.19.0 (33b34b3) consul agent 0.8.4 and Consul server 1.2.2 versions.

We have this issue in heavy loaded Consul server and consul agents in the production. I can reproduce it only if I run consul-template rendering in the loop and the same time restart the consul agent, but according to our SRE, this issue happens time-to-time when an instance goes up.

During this time there were below log messages in Consul: 2019/10/21 02:05:29 [ERR] memberlist: Conflicting address for test-instance. Mine: 10.208.122.115:8301 Theirs: 10.208.122.117:8301

But the Consul server version 1.2.2 doesn't prevent to add clients with the same name but different address (this issue was present since v1.2.3 to 1.5.1).

@aclowkey do you have some workaround for this?

We tried to add ExecStartPost to the consul agent systemd unit with checking the connection to the Consul cluster, but no luck.