hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.82k stars 1.94k forks source link

template stanza doesn't load env as expected #6112

Open danaps opened 5 years ago

danaps commented 5 years ago

Nomad version

Nomad v0.9.4 (a81aa846a45fb8248551b12616287cb57c418cd6) (encountered the same issue on 0.9.3)

Operating system and Environment details

Ubuntu 16.04.6 LTS 4.4.0-1085-aws #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue

when using range service in nomad template stanza and setting env = true it write the output to the destination file but it doesn't always load the env var as expected. the issue happened sporadically.

Reproduction steps

I have a task that watch a service (redis) in consul, I killed the docker (docker kill) that run the redis service. the task with redis raised again, then the task that watched the redis restarted as well, (used the default change_mode restart) the file was re-renderd but the env var did not load. it happens sporadically and I must add that I had to kill the redis like 3 or 4 times until the issue was reproduced.

Nomad Client logs (if appropriate)

the task that watch redis:

docker ps |grep f5f88d7f
88500180a7b3        62b95289a942                                                         "unicorn -c config/u…"   8 seconds ago       Up 5 seconds        10.101.8.65:23101->3000/tcp, 10.101.8.65:23101->3000/udp               ads-f5f88d7f-4187-c146-eee0-0a48e895f66e

docker exec -it 88500180a7b3 env |grep REDIS
CACHE_REDIS_DB=6
CACHE_REDIS_HOST=10.101.8.200
CACHE_REDIS_PORT=31940
RESQUE_REDIS_DB=7
RESQUE_REDIS_HOST=10.101.8.200
RESQUE_REDIS_PORT=31940
docker exec -it 88500180a7b3 cat /local/services.env

CACHE_REDIS_HOST="10.101.8.200"
CACHE_REDIS_PORT="31940"
RESQUE_REDIS_HOST="10.101.8.200"
RESQUE_REDIS_PORT="31940"

MYSQL_HOST="10.101.8.116"
MYSQL_PORT="31599"

the task that run the redis service:

docker ps |grep 68d48fe2
c5868397b5b8        87856cc39862        "docker-entrypoint.s…"   14 minutes ago      Up 14 minutes       10.101.8.200:31940->6379/tcp, 10.101.8.200:31940->6379/udp                                                               redis-68d48fe2-0743-dd26-051a-e0102fab21bc
docker kill c5868397b5b8

docker ps |grep 68d48fe2
97109d3f2dd0        87856cc39862        "docker-entrypoint.s…"   8 seconds ago       Up 7 seconds        10.101.8.200:31940->6379/tcp, 10.101.8.200:31940->6379/udp                                                               redis-68d48fe2-0743-dd26-051a-e0102fab21bc

again the task that watch redis, you can see that there are now missing envs:

docker ps |grep f5f88d7f
8d14dab4e756        402837048690.dkr.ecr.us-east-1.amazonaws.com/ads:master              "unicorn -c config/u…"   36 seconds ago      Up 34 seconds       10.101.8.65:23101->3000/tcp, 10.101.8.65:23101->3000/udp               ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
docker exec -it 8d14dab4e756 cat /local/services.env

CACHE_REDIS_HOST="10.101.8.200"
CACHE_REDIS_PORT="31940"
RESQUE_REDIS_HOST="10.101.8.200"
RESQUE_REDIS_PORT="31940"

MYSQL_HOST="10.101.8.116"
MYSQL_PORT="31599"

docker exec -it 8d14dab4e756 env |grep REDIS
CACHE_REDIS_DB=6
RESQUE_REDIS_DB=7

nomad logs:


Aug 12 10:59:04 ip-10-101-8-65 nomad[4493]:     2019/08/12 10:59:04.746979 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 10:59:04 ip-10-101-8-65 nomad[4493]:     2019/08/12 10:59:04.754962 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.044Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.065Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.072Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo timestamp=2019-08-12T10:59:08.070Z
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.076Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo @module=logmon timestamp=2019-08-12T10:59:08.072Z
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.194Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 10:59:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T10:59:08.194Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
Aug 12 11:00:02 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:02.604545 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:02 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:02.631092 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.419Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.712Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.721Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo @module=logmon timestamp=2019-08-12T11:00:08.718Z
Aug 12 11:00:08 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:00:08.724Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo timestamp=2019-08-12T11:00:08.718Z
Aug 12 11:00:28 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:28.861960 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:29 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:29.521514 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:29 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:29.523164 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:30 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:30.835442 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:37 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:37.843632 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:00:37 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:00:37.847395 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:02:09 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:02:09.075Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 11:02:09 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:02:09.075Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e
Aug 12 11:14:44 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:14:44.390644 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:14:44 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:14:44.397878 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:14:49 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:49.171Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads error="container stopped"
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.922Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads reason= delay=0s
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.933Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stdout.fifo timestamp=2019-08-12T11:14:52.929Z
Aug 12 11:14:52 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:14:52.933Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f5f88d7f-4187-c146-eee0-0a48e895f66e task=ads @module=logmon path=/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc/logs/.ads.stderr.fifo timestamp=2019-08-12T11:14:52.929Z
Aug 12 11:15:11 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:11.163577 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.155859 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.156466 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.738310 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:13 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:13.739211 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:15:19 ip-10-101-8-65 nomad[4493]:     2019/08/12 11:15:19.124726 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local/services.env"
Aug 12 11:17:26 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:17:26.334Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ads binds="[]string{"/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/alloc:/alloc", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/local:/local", "/var/lib/nomad/alloc/f5f88d7f-4187-c146-eee0-0a48e895f66e/ads/secrets:/secrets"}"
Aug 12 11:17:26 ip-10-101-8-65 nomad[4493]:     2019-08-12T11:17:26.335Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ads container_name=ads-f5f88d7f-4187-c146-eee0-0a48e895f66e```
danaps commented 5 years ago

It is kind of similar to this: https://github.com/hashicorp/nomad/issues/3498 but I also encountered this issue when there was only one template stanza (Unfortunately i didn't kept the log, it was kibana docker that was rendering elasticsearch service and only that) @burdandrei :)

burdandrei commented 5 years ago

@danaps could you please post the job definition?

danaps commented 5 years ago
{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${attr.kernel.name}",
                "Operand": "=",
                "RTarget": "linux"
            },
            {
                "LTarget": "${node.class}",
                "Operand": "=",
                "RTarget": "093-asg-spot"
            }
        ],
        "CreateIndex": 263893356,
        "Datacenters": [
            "staging-us-east-1"
        ],
        "Dispatched": false,
        "ID": "danap-stack-ads",
        "JobModifyIndex": 263893356,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 263895417,
        "Name": "danap-stack-ads",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "staging-us-east-1",
        "Reschedule": null,
        "Stable": false,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1565692620717080477,
        "TaskGroups": [
            {
                "Constraints": [
                    {
                        "LTarget": "",
                        "Operand": "distinct_hosts",
                        "RTarget": "true"
                    },
                    {
                        "LTarget": "${attr.vault.version}",
                        "Operand": "version",
                        "RTarget": ">= 0.6.1"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "ads",
                "ReschedulePolicy": {
                    "Attempts": 3,
                    "Delay": 30000000000,
                    "DelayFunction": "constant",
                    "Interval": 3600000000000,
                    "MaxDelay": 3600000000000,
                    "Unlimited": false
                },
                "RestartPolicy": {
                    "Attempts": 4,
                    "Delay": 25000000000,
                    "Interval": 300000000000,
                    "Mode": "delay"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "force_pull": true,
                            "image": "",
                            "logging": [
                                {
                                    "config": [
                                        {
                                            "gelf-address": "udp://0.0.0.0:5000",
                                            "tag": "{\"es_index\":\"danap\", \"nomad_job_name\":\"${NOMAD_JOB_NAME}\",\"nomad_group_name\":\"${NOMAD_GROUP_NAME}\",\"nomad_task_name\":\"${NOMAD_TASK_NAME}\",\"nomad_alloc_id\":\"${NOMAD_ALLOC_ID}\"}"
                                        }
                                    ],
                                    "type": "gelf"
                                }
                            ],
                            "port_map": [
                                {
                                    "unicorn": 3000.0
                                }
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "RESQUE_REDIS_DB": "7",
                            "_VERSION": "2019-08-13 10:37:00.195750017 +0000 UTC m=+0.019961268",
                            "CACHE_REDIS_DB": "6",
                            "LOGSTASH_SYNC": "true",
                            "MYSQL_DATABASE": "ads_danap",
                            "RAILS_ENV": "staging"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "ads",
                        "Resources": {
                            "CPU": 1000,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 1000,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "unicorn",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "HTTP Health",
                                        "Path": "/ping",
                                        "PortLabel": "unicorn",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "http"
                                    }
                                ],
                                "Id": "",
                                "Name": "danap-stack-ads",
                                "PortLabel": "unicorn",
                                "Tags": [

                                ]
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "",
                                "DestPath": "local/services.env",
                                "EmbeddedTmpl": "\n  {{ range service \"danap-stack-redis\" }}\n    CACHE_REDIS_HOST=\"{{ .Address }}\"\n    CACHE_REDIS_PORT=\"{{ .Port }}\"\n    RESQUE_REDIS_HOST=\"{{ .Address }}\"\n    RESQUE_REDIS_PORT=\"{{ .Port }}\"\n  {{ end }}\n\n  {{ range service \"danap-stack-mysql\" }}\n    MYSQL_HOST=\"{{ .Address }}\"\n    MYSQL_PORT=\"{{ .Port }}\"\n  {{ end }}\n  ",
                                "Envvars": true,
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "VaultGrace": 15000000000
                            }
                        ],
                        "User": "",
                        "Vault": {
                            "ChangeMode": "restart",
                            "ChangeSignal": "SIGHUP",
                            "Env": true,
                            "Policies": [
                                ""
                            ]
                        }
                    }
                ],
                "Update": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 0
    }
}
elifish4 commented 5 years ago

@dadgar Can you assist please? :)

cgbaker commented 5 years ago

@elifish4 , I'm looking at this right now.

cgbaker commented 5 years ago

@danaps , we've found a race condition around this use case. The issue is that when Nomad registers the upstream service (redis, in your example), it initially comes up as healthy, then the checks are registered and the service is critical, then the checks start passing and the service is healthy again. The result is that the consul-template can render three times; each one of these will try to restart the downstream container (ads, in your example). Depending on how these (restart requests, template renders, and restart attempts) all line up, the race condition can result in the running container having an environment that is incorrect.

This is a bug, which will be addressed. I appreciate your help in tracking this one down. In the short term, increasing the splay parameter in the template reduces the likelihood of this happening (albeit, by giving the container multiple opportunities to restart). Similarly, increasing the check interval for the upstream services can have the same effect.

I will leave this issue open and update it later before this fix is shipped.

elifish4 commented 5 years ago

thanks @cgbaker

danaps commented 5 years ago

@cgbaker, Thank you

jtrivino95 commented 4 years ago

The workaround we went for to solve this is to add a check in the entrypoint of the container. If the envvar is not there, the container will exit. This way the container will be restarted until the template is loaded correctly in the environment.

    if [[ -z $CELERY_BROKER_URL ]]; then
      exit 1;
    fi
lbachar commented 4 years ago

When are you going to merge the fix? @schmichael @cgbaker

NetanelMoshkovitz commented 4 years ago

@schmichael @cgbaker, any news about this issue?

lbachar commented 4 years ago

The bug still occurs on Nomad ver 0.11.4

burdandrei commented 4 years ago

how is this not a critical bug? https://github.com/hashicorp/nomad/issues/3498 was addressed pretty fast.

Having a template re-rendered but env not updated is crucial for anyone running 12-factor like apps

tgross commented 4 years ago

@burdandrei we have investigated this one at length and it's decidedly non-trivial, and we have to trade off resources between any given bug, support for enterprise customers, and other feature development. The workaround that @jtrivino95 provided above will handle this case, and in general it's a good practice for applications to assert their dependencies. As you might imagine, bugs with a reasonable workaround tend to fall behind those without.

Just to dump the results of some of our investigations here... the bug is reproducible, albeit not 100% of the time, with the following set of jobs:

postgres.nomad ```hcl job "database" { datacenters = ["dc1"] group "database" { restart { attempts = 100 delay = "3s" } task "postgres" { driver = "docker" config { image = "postgres" port_map { db = 5432 } } resources { cpu = 500 memory = 256 network { mbits = 10 port "db" {} } } service { name = "postgres" tags = ["global", "postgres"] port = "db" check { name = "alive" type = "tcp" interval = "10s" timeout = "2s" } } } } } ```
nginx.nomad ```hcl job "web" { datacenters = ["dc1"] group "web" { task "nginx" { driver = "docker" config { image = "nginx:alpine" port_map { http = 80 } } resources { cpu = 500 memory = 256 network { mbits = 10 port "http" {} } } template { change_mode = "restart" destination = "local/services.env" splay = "0s" env = true, data = <
redis.nomad ```hcl job "cache" { datacenters = ["dc1"] group "cache" { restart { attempts = 100 delay = "3s" } task "redis" { driver = "docker" config { image = "redis:3.2" port_map { db = 6379 } } resources { cpu = 500 memory = 256 network { mbits = 10 port "db" {} } } service { name = "redis" tags = ["global", "cache"] port = "db" check { name = "alive" type = "tcp" interval = "10s" timeout = "2s" } } } } } ```
  1. Run Consul and Nomad
  2. Run postgres, redis and nginx jobs
  3. Monitor the environment in the nginx container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ env | grep UPSTREAM'
  4. Monitor the services in the nginx container: watch -n 0.1 'docker ps | grep nginx | awk "{print \$1}" | xargs -I_ docker exec _ cat /local/services.env'
  5. Kill either the postgres or redis containers.

At that point, the nginx allocation will restart without the killed upstream. After the killed task is restarted, it should show up in the rendered template.

  • Expected behavior: the associated environment variable shows up in the container process env.
  • Reproduced bug: the associated environment variable does not show up in the container process env.

The diagram below illustrates the window of the race:

screenshot

As you can see, it's not just a matter of atomically registering the Consul service and the health check at the same time (which would be #3935). And it's not quite as easy as merging some data structures as we did with #3498. But if this is an area that interests you, we'd love to hear your ideas! Thanks!

burdandrei commented 4 years ago

Thanks for so comprehensive answer @tgross! Sorry if it was looking like I'm trying to manage nomad's team resources. It's clear that the team got a lot on its plate. To be honest, I never dug the code to follow the templates flow, just assumed that after every rewrite restart is queued to ensure that all the environment variables are set to be in place.

P.S. timeline graph is awesome!

jpatidar30 commented 3 months ago

Any updates on this bug? is it fixed in the newer versions? we facing this issue in nomad version 1.8.0

jrasell commented 3 months ago

Hi @jpatidar30 there are no updates currently. When an engineer is assigned to this and working on it, updates will be provided in the issue.