docker: Client.Timeout exceeded while awaiting headers -> alloc failure

apark-ocient commented 3 years ago

Versions

Nomad v1.1.4 (acd3d7889328ad1df2895eb714e2cbe3dd9c6d82) Observed separately on client nodes:

Docker version 18.09.7, build 2d0083d, on Ubuntu 16.04(!) in an lxc container
Docker version 20.10.6, build 370c289, on CentOS

Issue

When starting multiple Docker jobs in rapid succession (<2 seconds), sometimes the docker driver will register a timeout error for both jobs:

2021-09-13T20:07:40.207Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=REDACTED image_name=REDACTED image_id=REDACTED attempt=1 error="Post "http://unix.sock/containers/create?name=REDACTED": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-09-13T20:07:40.207Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=REDACTED image_name=REDACTED image_id=REDACTED attempt=1 error="Post "http://unix.sock/containers/create?name=REDACTED": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Apparently, the containers were actually created:

    2021-09-13T20:07:40.441Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=REDACTED image_name=REDACTED image_id=REDACTED attempt=2 error="container already exists"
    2021-09-13T20:07:40.441Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=REDACTED image_name=REDACTED image_id=REDACTED attempt=2 error="container already exists"

The existing retry logic does not seem to handle whatever happened, because that error repeats. This eventually registers as an unrecoverable driver failure, and the allocation fails. docker ps -a shows that both containers were created but not started.

The expected behavior is that the container existence would be acknowledged (and obviously not fail the allocation).

Job file (if appropriate)

Some internal details have been removed. This is from a Python script. test is an alphanumeric/underscore identifier.

{
                'Name': nomad_job_id,
                'ID': nomad_job_id,
                'Type': 'batch',
                'Priority': 20,
                "Datacenters": [NOMAD_DC],
                'TaskGroups': [{
                        'Name': f'{str(test)}-taskgroup',
                        'Tasks': [{
                            'Name': f'{str(test)}-task',
                            'Driver': 'docker',
                            'Config': {
                                'image': DOCKER_IMAGE,
                                'command': 'REDACTED',
                                'args': args,
                                'mounts': [
                                    {
                                        'type': 'bind',
                                        'target': REMOTE_REPO_PATH,
                                        'source': 'local/xgsrc'
                                    }
                                ],
                            },
                            'Artifacts': [{
                                'GetterSource': f'{REDACTED}',
                                'RelativeDest': 'REDACTED'
                            }],
                            'Resources': {
                                'Cores': cores, # usually 1
                                'MemoryMB': 10240,
                            },
                            # TODO this doesn't seem to do anything today, but it also doesn't seem to break anything
                            "EphemeralDisk": {
                                "SizeMB": 30000
                            },
                            "KillTimeout": 10 * 60 * 1_000_000_000,  # 10m in nanoseconds
                            # don't bother restarting tasks locally; it's better to reschedule the entire allocation elsewhere
                            'RestartPolicy': {
                                'Attempts': 0,
                                'Mode': 'fail',
                            }
                        }],
                    "ReschedulePolicy": {
                        "Attempts": 2,
                        "Unlimited": False,
                    },
                }]
            }

lgfa29 commented 3 years ago

Hi @apark-ocient 👋

Thanks for the report. Would it be possible to share more of the log outputs? Do all further retries fail with container already exist?

apark-ocient commented 3 years ago

The container already exists message persists through attempt 6, at which point the attempts reaches its cap and that alloc fails. I haven't tried allowing restarts, but I'll do so and send logs of both things to the email in the issue template.

When the allocation is rescheduled, it has a chance to succeed then (which is to say the same issue can happen again).

lgfa29 commented 3 years ago

docker ps -a shows that both containers were created but not started.

Sorry, I missed this in your original message.

The creation loop only exits if the existing container is in the "Running" state, so maybe there was an error starting the container?

Did you happen to check the container logs to see if there was anything there?

lgfa29 commented 8 months ago

Hello from the future 👋

I cleaning up some stale issue and we haven't had any updates on this one for a while, so I'm going to close it for now. But let me know if this is still a problem and we can reopen it.

Thank you!

johnnyplaydrums commented 5 months ago

Hi @lgfa29! We've been experiencing this issue for a while and finally dug into the debug logs to see what's going on. The symptom we see is: kick off a job (usually these are batch jobs, but not sure that's relevant), nomad tries to create the container but times out waiting for docker, then nomad tries to create the container again a few times but hits container already exists error each time, then the job is marked as failed. We see this happen about once a day in our environment (1000s of batch jobs per day per nomad client). As described in the original post, we see context deadline exceeded (Client.Timeout exceeded while awaiting headers)" during container creation, but the container does indeed get created, so when nomad tries again, it hits error="container already exists" and the job fails. Here's the debug output showing this for a recent job:

2024-05-22T16:54:36.903Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type=Received msg="Task received by client" failed=false
2024-05-22T16:54:36.903Z [DEBUG] client.alloc_runner.widmgr: retrieving and renewing workload identities: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 num_identities=0
2024-05-22T16:54:36.905Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference
2024-05-22T16:54:36.907Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type="Task Setup" msg="Building Task Directory" failed=false
2024-05-22T16:54:36.913Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference path=/usr/local/bin/nomad args=["/usr/local/bin/nomad", "logmon"]
2024-05-22T16:54:36.913Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference path=/usr/local/bin/nomad pid=1837990
2024-05-22T16:54:36.913Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference plugin=/usr/local/bin/nomad
2024-05-22T16:54:36.934Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon address=/tmp/plugin4179122346 network=unix timestamp=2024-05-22T16:54:36.934Z
2024-05-22T16:54:36.934Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference version=2
2024-05-22T16:54:36.935Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon path=/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/alloc/logs/.inference.stdout.fifo timestamp=2024-05-22T16:54:36.934Z
2024-05-22T16:54:36.935Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon path=/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/alloc/logs/.inference.stderr.fifo timestamp=2024-05-22T16:54:36.935Z
2024-05-22T16:54:36.960Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type=Driver msg="Downloading image" failed=false
2024-05-22T16:54:59.213Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=inference binds="[]string{\"/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/alloc:/alloc\", \"/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/inference/local:/local\", \"/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/inference/secrets:/secrets\"}"
2024-05-22T16:54:59.213Z [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=inference labels="map[com.hashicorp.nomad.alloc_id:21aa43f7-31cd-db42-8708-ff467a9f8c96 com.hashicorp.nomad.job_id:ultrasonic_flow/dispatch-1716396876-8795bc46 com.hashicorp.nomad.job_name:ultrasonic_flow/dispatch-1716396876-8795bc46 com.hashicorp.nomad.namespace:default com.hashicorp.nomad.node_id:df293b02-aeeb-b916-326a-d7d569a505a4 com.hashicorp.nomad.node_name:nomad-client-production-172-31-146-162 com.hashicorp.nomad.parent_job_id:ultrasonic_flow com.hashicorp.nomad.task_group_name:ultrasonic_flow com.hashicorp.nomad.task_name:inference]"
2024-05-22T16:54:59.213Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=inference container_name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96
2024-05-22T16:59:59.214Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96 image_name=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f image_id=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f attempt=1 error="Post \"http://unix.sock/containers/create?name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2024-05-22T16:59:59.316Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96 image_name=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f image_id=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f attempt=2 error="container already exists"
2024-05-22T16:59:59.525Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96 image_name=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f image_id=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f attempt=3 error="container already exists"
2024-05-22T16:59:59.933Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96 image_name=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f image_id=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f attempt=4 error="container already exists"
2024-05-22T17:00:00.742Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96 image_name=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f image_id=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f attempt=5 error="container already exists"
2024-05-22T17:00:02.350Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=inference-21aa43f7-31cd-db42-8708-ff467a9f8c96 image_name=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f image_id=108202764860.dkr.ecr.us-east-1.amazonaws.com/ultrasonic_flow:f7cc1eeddb58e0d3e770cbbffc92f9ab743aac3f attempt=6 error="container already exists"
2024-05-22T17:00:02.357Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type="Driver Failure" msg="failed to create container: container already exists" failed=false
2024-05-22T17:00:02.414Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference error="failed to create container: container already exists"
2024-05-22T17:00:02.414Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference reason="Policy allows no restarts"
2024-05-22T17:00:02.414Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type="Not Restarting" msg="Policy allows no restarts" failed=true
2024-05-22T17:00:06.624Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon timestamp=2024-05-22T17:00:06.624Z
2024-05-22T17:00:06.624Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon timestamp=2024-05-22T17:00:06.624Z
2024-05-22T17:00:06.624Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference err="rpc error: code = Unavailable desc = error reading from server: EOF"
2024-05-22T17:00:06.626Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference plugin=/usr/local/bin/nomad id=1837990
2024-05-22T17:00:06.626Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference
2024-05-22T17:00:06.626Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference
2024-05-22T17:00:06.626Z [INFO]  client.gc: marking allocation for GC: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96
2024-05-22T17:10:40.840Z [INFO]  client.gc: garbage collecting allocation: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 reason="new allocations and over max (50)"
2024-05-22T17:10:40.842Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
2024-05-22T17:10:40.843Z [INFO]  client.gc: marking allocation for GC: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96
2024-05-22T17:10:40.846Z [DEBUG] client.gc: alloc garbage collected: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96

With so many allocations churning in our environment, docker can sometimes get busy, which is why I suspect we hit the timeout. Anything we can do on our end to work around this (adjust timeout, etc)?

And to answer your question above

The creation loop only exits if the existing container is in the "Running" state, so maybe there was an error starting the container?

The container never enters a Running state, as once Nomad hits the container already exists error the job fails. Rescheduling this jobs on another node could be a temp fix but for us is not an option in this case due to service-level requirements. If we've found a bug, I suspect it is in the code you linked.

Let me know if you'd like me to open a new issue for this instead. Thank you!

lgfa29 commented 5 months ago

Hi @johnnyplaydrums 👋

Thank you for the extra information. From a quick read it sounds like Docker is able to create the container, but fails to start it because of the overload on the daemon. Since the container never starts then Nomad doesn't consider the alloc as running.

I would suggest you open a new issue. Since I no longer work at HashiCorp I'm not able to reopen it.

johnnyplaydrums commented 5 months ago

@lgfa29 thanks for responding! Thanks for all your help maintaining these incredible products <3 Good luck in your future endeavors!

hashicorp / nomad