Open danaps opened 5 years ago
It is kind of similar to this: https://github.com/hashicorp/nomad/issues/3498 but I also encountered this issue when there was only one template stanza (Unfortunately i didn't kept the log, it was kibana docker that was rendering elasticsearch service and only that) @burdandrei :)
@danaps could you please post the job definition?
{
"Job": {
"AllAtOnce": false,
"Constraints": [
{
"LTarget": "${attr.kernel.name}",
"Operand": "=",
"RTarget": "linux"
},
{
"LTarget": "${node.class}",
"Operand": "=",
"RTarget": "093-asg-spot"
}
],
"CreateIndex": 263893356,
"Datacenters": [
"staging-us-east-1"
],
"Dispatched": false,
"ID": "danap-stack-ads",
"JobModifyIndex": 263893356,
"Meta": null,
"Migrate": null,
"ModifyIndex": 263895417,
"Name": "danap-stack-ads",
"Namespace": "default",
"ParameterizedJob": null,
"ParentID": "",
"Payload": null,
"Periodic": null,
"Priority": 50,
"Region": "staging-us-east-1",
"Reschedule": null,
"Stable": false,
"Status": "running",
"StatusDescription": "",
"Stop": false,
"SubmitTime": 1565692620717080477,
"TaskGroups": [
{
"Constraints": [
{
"LTarget": "",
"Operand": "distinct_hosts",
"RTarget": "true"
},
{
"LTarget": "${attr.vault.version}",
"Operand": "version",
"RTarget": ">= 0.6.1"
}
],
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "ads",
"ReschedulePolicy": {
"Attempts": 3,
"Delay": 30000000000,
"DelayFunction": "constant",
"Interval": 3600000000000,
"MaxDelay": 3600000000000,
"Unlimited": false
},
"RestartPolicy": {
"Attempts": 4,
"Delay": 25000000000,
"Interval": 300000000000,
"Mode": "delay"
},
"Tasks": [
{
"Artifacts": null,
"Config": {
"force_pull": true,
"image": "",
"logging": [
{
"config": [
{
"gelf-address": "udp://0.0.0.0:5000",
"tag": "{\"es_index\":\"danap\", \"nomad_job_name\":\"${NOMAD_JOB_NAME}\",\"nomad_group_name\":\"${NOMAD_GROUP_NAME}\",\"nomad_task_name\":\"${NOMAD_TASK_NAME}\",\"nomad_alloc_id\":\"${NOMAD_ALLOC_ID}\"}"
}
],
"type": "gelf"
}
],
"port_map": [
{
"unicorn": 3000.0
}
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": {
"RESQUE_REDIS_DB": "7",
"_VERSION": "2019-08-13 10:37:00.195750017 +0000 UTC m=+0.019961268",
"CACHE_REDIS_DB": "6",
"LOGSTASH_SYNC": "true",
"MYSQL_DATABASE": "ads_danap",
"RAILS_ENV": "staging"
},
"KillSignal": "",
"KillTimeout": 5000000000,
"Leader": false,
"LogConfig": {
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "ads",
"Resources": {
"CPU": 1000,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 1000,
"Networks": [
{
"CIDR": "",
"Device": "",
"DynamicPorts": [
{
"Label": "unicorn",
"Value": 0
}
],
"IP": "",
"MBits": 10,
"ReservedPorts": null
}
]
},
"Services": [
{
"AddressMode": "auto",
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Args": null,
"CheckRestart": null,
"Command": "",
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"Id": "",
"InitialStatus": "",
"Interval": 5000000000,
"Method": "",
"Name": "HTTP Health",
"Path": "/ping",
"PortLabel": "unicorn",
"Protocol": "",
"TLSSkipVerify": false,
"Timeout": 3000000000,
"Type": "http"
}
],
"Id": "",
"Name": "danap-stack-ads",
"PortLabel": "unicorn",
"Tags": [
]
}
],
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeSignal": "",
"DestPath": "local/services.env",
"EmbeddedTmpl": "\n {{ range service \"danap-stack-redis\" }}\n CACHE_REDIS_HOST=\"{{ .Address }}\"\n CACHE_REDIS_PORT=\"{{ .Port }}\"\n RESQUE_REDIS_HOST=\"{{ .Address }}\"\n RESQUE_REDIS_PORT=\"{{ .Port }}\"\n {{ end }}\n\n {{ range service \"danap-stack-mysql\" }}\n MYSQL_HOST=\"{{ .Address }}\"\n MYSQL_PORT=\"{{ .Port }}\"\n {{ end }}\n ",
"Envvars": true,
"LeftDelim": "{{",
"Perms": "0644",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"VaultGrace": 15000000000
}
],
"User": "",
"Vault": {
"ChangeMode": "restart",
"ChangeSignal": "SIGHUP",
"Env": true,
"Policies": [
""
]
}
}
],
"Update": null
}
],
"Type": "service",
"Update": {
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "",
"HealthyDeadline": 0,
"MaxParallel": 0,
"MinHealthyTime": 0,
"ProgressDeadline": 0,
"Stagger": 0
},
"VaultToken": "",
"Version": 0
}
}
@dadgar Can you assist please? :)
@elifish4 , I'm looking at this right now.
@danaps , we've found a race condition around this use case. The issue is that when Nomad registers the upstream service (redis
, in your example), it initially comes up as healthy, then the checks are registered and the service is critical, then the checks start passing and the service is healthy again. The result is that the consul-template can render three times; each one of these will try to restart the downstream container (ads
, in your example). Depending on how these (restart requests, template renders, and restart attempts) all line up, the race condition can result in the running container having an environment that is incorrect.
This is a bug, which will be addressed. I appreciate your help in tracking this one down. In the short term, increasing the splay
parameter in the template reduces the likelihood of this happening (albeit, by giving the container multiple opportunities to restart). Similarly, increasing the check interval for the upstream services can have the same effect.
I will leave this issue open and update it later before this fix is shipped.
thanks @cgbaker
@cgbaker, Thank you
The workaround we went for to solve this is to add a check in the entrypoint of the container. If the envvar is not there, the container will exit. This way the container will be restarted until the template is loaded correctly in the environment.
if [[ -z $CELERY_BROKER_URL ]]; then
exit 1;
fi
When are you going to merge the fix? @schmichael @cgbaker
@schmichael @cgbaker, any news about this issue?
The bug still occurs on Nomad ver 0.11.4
how is this not a critical bug? https://github.com/hashicorp/nomad/issues/3498 was addressed pretty fast.
Having a template re-rendered but env not updated is crucial for anyone running 12-factor like apps
@burdandrei we have investigated this one at length and it's decidedly non-trivial, and we have to trade off resources between any given bug, support for enterprise customers, and other feature development. The workaround that @jtrivino95 provided above will handle this case, and in general it's a good practice for applications to assert their dependencies. As you might imagine, bugs with a reasonable workaround tend to fall behind those without.
Just to dump the results of some of our investigations here... the bug is reproducible, albeit not 100% of the time, with the following set of jobs:
nginx
container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ env | grep UPSTREAM'
nginx
container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ cat /local/services.env'
At that point, the nginx
allocation will restart without the killed upstream. After the killed task is restarted, it should show up in the rendered template.
The diagram below illustrates the window of the race:
As you can see, it's not just a matter of atomically registering the Consul service and the health check at the same time (which would be #3935). And it's not quite as easy as merging some data structures as we did with #3498. But if this is an area that interests you, we'd love to hear your ideas! Thanks!
Thanks for so comprehensive answer @tgross! Sorry if it was looking like I'm trying to manage nomad's team resources. It's clear that the team got a lot on its plate. To be honest, I never dug the code to follow the templates flow, just assumed that after every rewrite restart is queued to ensure that all the environment variables are set to be in place.
P.S. timeline graph is awesome!
Any updates on this bug? is it fixed in the newer versions? we facing this issue in nomad version 1.8.0
Hi @jpatidar30 there are no updates currently. When an engineer is assigned to this and working on it, updates will be provided in the issue.
Nomad version
Nomad v0.9.4 (a81aa846a45fb8248551b12616287cb57c418cd6)
(encountered the same issue on 0.9.3)Operating system and Environment details
Ubuntu 16.04.6 LTS
4.4.0-1085-aws #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Issue
when using range service in nomad template stanza and setting env = true it write the output to the destination file but it doesn't always load the env var as expected. the issue happened sporadically.
Reproduction steps
I have a task that watch a service (redis) in consul, I killed the docker (docker kill) that run the redis service. the task with redis raised again, then the task that watched the redis restarted as well, (used the default change_mode restart) the file was re-renderd but the env var did not load. it happens sporadically and I must add that I had to kill the redis like 3 or 4 times until the issue was reproduced.
Nomad Client logs (if appropriate)
the task that watch redis:
the task that run the redis service:
again the task that watch redis, you can see that there are now missing envs:
nomad logs: