Closed cpumanaz closed 5 years ago
I'm seeing this happen on linux as well. When a job with a docker container is running on a host and nomad is restarted, the docker driver responds with something like:
driver start failed: Failed to start container 1acb6f0fff52a975b35b1c17a7318accf8a0d49d5eecedfe15e6028ec0492182: API error (500): driver failed programming external connectivity on endpoint e9fa01ca-2ae0-4fa9-9b6b-3406be8f1636_movienight_7f3f9520 (1c62edb47de592450de37d30b6ee288822f72c464fe123f0627a87d301750889): Bind for 10.178.85.207:31629 failed: port is already allocated
Once this occurs, the container must be manually stopped.
Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was it using Windows containers or Linux ones? Also, was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?
We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprell.
Mahmood,
Thanks for reaching out. I’ll get a job spec out to you and document the reproduction steps.
-Matt Shooshtari.
Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10
From: Mahmood Ali notifications@github.com Sent: Sunday, February 17, 2019 8:11:12 AM To: hashicorp/nomad Cc: Matthew Shooshtari; Author Subject: Re: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)
Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?
We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprellhttps://github.com/jsipprell.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/hashicorp/nomad/issues/5296#issuecomment-464461893, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMoQ-XVfjHgeE95TWLxl5FAmO2WWK83fks5vOWMAgaJpZM4agSfG.
Mahmood,
The reproduction steps are as follows. Start the job using the job definition below.
Stop nomad. I use a windows service with NSSM, but you can kill the nomad agent process. Start nomad. Nomad tries to recover, but instead of re-attaching, it tries to download and start a new instance.
-Matt
Here are the log entries.
02/18/19 10:34:53
Not Restarting
Error was unrecoverable
02/18/19 10:34:53
Driver Failure
driver start failed: Failed to start container fd25bc67a8052fb483992e328f08db538805b337db7ddd91b659fd2f979b619b: API error (500): failed to create endpoint 9a9d4a94-845b-1bf1-402e-d5d3a2571b22_registry-task_fb9a1522 on network nat: hnsCall failed in Win32: The specified port already exists. (0x803b0013)
02/18/19 10:34:47
Driver
Downloading image
02/18/19 10:28:57
Started
Task started by client
02/18/19 10:28:52
Driver
Downloading image
02/18/19 10:28:52
Task Setup
Building Task Directory
02/18/19 10:28:52
Received
Task received by client
Here is the JSON job definition.
{ "Affinities": null, "AllAtOnce": false, "Constraints": null, "CreateIndex": 8, "Datacenters": [ "dc1" ], "Dispatched": false, "ID": "docker-registry-job", "JobModifyIndex": 8, "Meta": null, "ModifyIndex": 65, "Name": "docker-registry-job", "Namespace": "default", "ParameterizedJob": null, "ParentID": "", "Payload": null, "Periodic": null, "Priority": 50, "Region": "global", "Spreads": null, "Stable": true, "Status": "pending", "StatusDescription": "", "Stop": false, "SubmitTime": 1550507328984446200, "TaskGroups": [ { "Affinities": null, "Constraints": null, "Count": 1, "EphemeralDisk": { "Migrate": false, "SizeMB": 300, "Sticky": false }, "Meta": null, "Migrate": { "HealthCheck": "checks", "HealthyDeadline": 300000000000, "MaxParallel": 1, "MinHealthyTime": 10000000000 }, "Name": "registry-group", "ReschedulePolicy": { "Attempts": 0, "Delay": 30000000000, "DelayFunction": "exponential", "Interval": 0, "MaxDelay": 3600000000000, "Unlimited": true }, "RestartPolicy": { "Attempts": 2, "Delay": 15000000000, "Interval": 1800000000000, "Mode": "fail" }, "Spreads": null, "Tasks": [ { "Affinities": null, "Artifacts": null, "Config": { "port_map": [ { "HTTP": 5000 } ], "image": "stefanscherer/registry-windows:latest", "mounts": [ { "readonly": false, "source": "c:/registry", "target": "c:/registry", "type": "bind" } ] }, "Constraints": null, "DispatchPayload": null, "Driver": "docker", "Env": null, "KillSignal": "", "KillTimeout": 5000000000, "Leader": false, "LogConfig": { "MaxFileSizeMB": 10, "MaxFiles": 10 }, "Meta": null, "Name": "registry-task", "Resources": { "CPU": 500, "Devices": null, "DiskMB": 0, "IOPS": 0, "MemoryMB": 1000, "Networks": [ { "CIDR": "", "Device": "", "DynamicPorts": null, "IP": "", "MBits": 100, "ReservedPorts": [ { "Label": "HTTP", "Value": 5000 } ] } ] }, "Services": [ { "AddressMode": "auto", "CanaryTags": null, "Checks": [ { "AddressMode": "", "Args": null, "CheckRestart": null, "Command": "", "GRPCService": "", "GRPCUseTLS": false, "Header": null, "InitialStatus": "", "Interval": 10000000000, "Method": "", "Name": "docker-registry-check", "Path": "", "PortLabel": "", "Protocol": "", "TLSSkipVerify": false, "Timeout": 2000000000, "Type": "tcp" } ], "Name": "docker-registry", "PortLabel": "HTTP", "Tags": null } ], "ShutdownDelay": 0, "Templates": null, "User": "", "Vault": null } ], "Update": { "AutoRevert": true, "Canary": 0, "HealthCheck": "checks", "HealthyDeadline": 900000000000, "MaxParallel": 1, "MinHealthyTime": 10000000000, "ProgressDeadline": 1200000000000, "Stagger": 30000000000 } } ], "Type": "service", "Update": { "AutoRevert": false, "Canary": 0, "HealthCheck": "", "HealthyDeadline": 0, "MaxParallel": 1, "MinHealthyTime": 0, "ProgressDeadline": 0, "Stagger": 30000000000 }, "VaultToken": "", "Version": 0 }
Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10
From: Mahmood Ali notifications@github.com Sent: Sunday, February 17, 2019 8:11:12 AM To: hashicorp/nomad Cc: Matthew Shooshtari; Author Subject: Re: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)
Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?
We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprellhttps://github.com/jsipprell.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/hashicorp/nomad/issues/5296#issuecomment-464461893, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMoQ-XVfjHgeE95TWLxl5FAmO2WWK83fks5vOWMAgaJpZM4agSfG.
Here is my system configuration.
"data_dir": "$installPath/agent/data",
"datacenter": "dc1",
"leave_on_interrupt": true,
"leave_on_terminate": true,
"server": [
{
"bootstrap_expect": 1,
"enabled": true
}
],
"client": [
{
"enabled": true
}
],
"acl": [
{
"enabled": "$aclEnabled"
}
],
"vault": [
{
"enabled": true,
"address": "http://$($serverIP):8200",
"token": "$vaultToken"
}
],
"consul": [
{
"server_auto_join": false,
"address": "$($serverIP):8500"
}
]
}```
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
From: Matthew Shooshtari<mailto:matthew.shooshtari@outlook.com>
Sent: Monday, February 18, 2019 10:39 AM
To: hashicorp/nomad<mailto:reply@reply.github.com>; hashicorp/nomad<mailto:nomad@noreply.github.com>
Cc: Author<mailto:author@noreply.github.com>
Subject: RE: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)
Mahmood,
The reproduction steps are as follows. Start the job using the job definition below.
Stop nomad. I use a windows service with NSSM, but you can kill the nomad agent process.
Start nomad. Nomad tries to recover, but instead of re-attaching, it tries to download and start a new instance.
-Matt
Here are the log entries.
02/18/19 10:34:53
Not Restarting
Error was unrecoverable
02/18/19 10:34:53
Driver Failure
driver start failed: Failed to start container fd25bc67a8052fb483992e328f08db538805b337db7ddd91b659fd2f979b619b: API error (500): failed to create endpoint 9a9d4a94-845b-1bf1-402e-d5d3a2571b22_registry-task_fb9a1522 on network nat: hnsCall failed in Win32: The specified port already exists. (0x803b0013)
02/18/19 10:34:47
Driver
Downloading image
02/18/19 10:28:57
Started
Task started by client
02/18/19 10:28:52
Driver
Downloading image
02/18/19 10:28:52
Task Setup
Building Task Directory
02/18/19 10:28:52
Received
Task received by client
Here is the JSON job definition.
{
"Affinities": null,
"AllAtOnce": false,
"Constraints": null,
"CreateIndex": 8,
"Datacenters": [
"dc1"
],
"Dispatched": false,
"ID": "docker-registry-job",
"JobModifyIndex": 8,
"Meta": null,
"ModifyIndex": 65,
"Name": "docker-registry-job",
"Namespace": "default",
"ParameterizedJob": null,
"ParentID": "",
"Payload": null,
"Periodic": null,
"Priority": 50,
"Region": "global",
"Spreads": null,
"Stable": true,
"Status": "pending",
"StatusDescription": "",
"Stop": false,
"SubmitTime": 1550507328984446200,
"TaskGroups": [
{
"Affinities": null,
"Constraints": null,
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "registry-group",
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 30000000000,
"DelayFunction": "exponential",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 2,
"Delay": 15000000000,
"Interval": 1800000000000,
"Mode": "fail"
},
"Spreads": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"port_map": [
{
"HTTP": 5000
}
],
"image": "stefanscherer/registry-windows:latest",
"mounts": [
{
"readonly": false,
"source": "c:/WatchGuardVideo/registry",
"target": "c:/registry",
"type": "bind"
}
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Leader": false,
"LogConfig": {
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "registry-task",
"Resources": {
"CPU": 500,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 1000,
"Networks": [
{
"CIDR": "",
"Device": "",
"DynamicPorts": null,
"IP": "",
"MBits": 100,
"ReservedPorts": [
{
"Label": "HTTP",
"Value": 5000
}
]
}
]
},
"Services": [
{
"AddressMode": "auto",
"CanaryTags": null,
"Checks": [
{
"AddressMode": "",
"Args": null,
"CheckRestart": null,
"Command": "",
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "docker-registry-check",
"Path": "",
"PortLabel": "",
"Protocol": "",
"TLSSkipVerify": false,
"Timeout": 2000000000,
"Type": "tcp"
}
],
"Name": "docker-registry",
"PortLabel": "HTTP",
"Tags": null
}
],
"ShutdownDelay": 0,
"Templates": null,
"User": "",
"Vault": null
}
],
"Update": {
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 900000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 1200000000000,
"Stagger": 30000000000
}
}
],
"Type": "service",
"Update": {
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "",
"HealthyDeadline": 0,
"MaxParallel": 1,
"MinHealthyTime": 0,
"ProgressDeadline": 0,
"Stagger": 30000000000
},
"VaultToken": "",
"Version": 0
}
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
From: Mahmood Ali <notifications@github.com>
Sent: Sunday, February 17, 2019 8:11:12 AM
To: hashicorp/nomad
Cc: Matthew Shooshtari; Author
Subject: Re: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)
Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?
We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprell<https://github.com/jsipprell>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://github.com/hashicorp/nomad/issues/5296#issuecomment-464461893>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AMoQ-XVfjHgeE95TWLxl5FAmO2WWK83fks5vOWMAgaJpZM4agSfG>.
@cpumanaz Hey - I've been able to reproduce it fairly intermittently when our docker log collection process is being stopped.
I'd be curious to know how consistently it happens for you, and how you're killing the nomad process? (and if you have them, any logs)
@cpumanaz Hey - I've been able to reproduce it fairly intermittently when our docker log collection process is being stopped.
I'd be curious to know how consistently it happens for you, and how you're killing the nomad process? (and if you have them, any logs)
I am killing the process with NSSM. I'm not actually sure how NSSM stops it. I do have the flags to do a graceful stop, and the log indicates that is occurring. I believe you are correct about logging. Here are my logs.
I am able to reproduce this consistently.
2019-02-19T08:59:05.731-0600 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad.exe: opening fifo: alloc_id=38bea3f9-1da8-17f4-2af5-3e028721bdc2 task=registry-task @module=logmon path=//./pipe/registry-task-1b221f34.stderr timestamp=2019-02-19T08:59:05.731-0600
2019-02-19T08:59:06.593-0600 [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=2d54be79db74c0e312af2e22a14393f78a774b2daf5751f8308f5dc64dccdb4a
2019-02-19T08:59:08.485-0600 [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=2d54be79db74c0e312af2e22a14393f78a774b2daf5751f8308f5dc64dccdb4a
==> Caught signal: interrupt
==> Gracefully shutting down agent...
2019-02-19T08:59:53.051-0600 [INFO ] nomad: server starting leave
2019-02-19T08:59:53.051-0600 [INFO ] nomad: serf: EventMemberLeave: *************.global *************
2019-02-19T08:59:53.051-0600 [INFO ] nomad: removing server: server="*************.global (Addr: *************:4647) (DC: dc1)"
2019-02-19T08:59:53.051-0600 [ERROR] nomad: failed to reconcile member: member="{*************.global ************* 4648 map[build:0.9.0-beta2 id:d3caed5a-2a5a-e2f8-382e-7e6d4e4fb521 rpc_addr:************* bootstrap:1 region:global mvn:1 role:nomad dc:dc1 vsn:1 raft_vsn:2 port:4647] left 1 5 2 2 5 4}" error="No servers found"
2019-02-19T08:59:54.052-0600 [INFO ] agent: requesting shutdown
2019-02-19T08:59:54.052-0600 [INFO ] client: shutting down
2019-02-19T08:59:54.062-0600 [INFO ] client.plugin: shutting down plugin manager: plugin-type=device
2019-02-19T08:59:54.063-0600 [INFO ] client.plugin: plugin manager finished: plugin-type=device
2019-02-19T08:59:54.063-0600 [INFO ] client.plugin: shutting down plugin manager: plugin-type=driver
2019-02-19T08:59:54.135-0600 [INFO ] client.plugin: plugin manager finished: plugin-type=driver
2019-02-19T08:59:54.152-0600 [INFO ] nomad: shutting down server
2019-02-19T08:59:54.223-0600 [INFO ] agent: shutdown complete
==> Loaded configuration from C:\*************\Nomad\agent\agent.json
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: *************:4646; RPC: *************:4647; Serf: *************:4648
Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Client: true
Log Level: INFO
Region: global (DC: dc1)
Server: true
Version: 0.9.0-beta2
==> Nomad agent started! Log data will stream in below:
2019-02-19T09:00:04.349-0600 [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\*************\Nomad\agent\data\plugins
2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2019-02-19T09:00:04.373-0600 [INFO ] nomad: raft: Initial configuration (index=1): [{Suffrage:Voter ID:*************:4647 Address:*************:4647}]
2019-02-19T09:00:04.373-0600 [INFO ] nomad: raft: Node at *************:4647 [Follower] entering Follower state (Leader: "")
2019-02-19T09:00:04.374-0600 [INFO ] nomad: serf: Ignoring previous leave in snapshot
2019-02-19T09:00:04.375-0600 [INFO ] nomad: serf: EventMemberJoin: *************.global *************
2019-02-19T09:00:04.375-0600 [INFO ] nomad: starting scheduling worker(s): num_workers=12 schedulers="[system service batch _core]"
2019-02-19T09:00:04.375-0600 [WARN ] nomad: serf: Failed to re-join any previously known node
2019-02-19T09:00:04.375-0600 [INFO ] nomad: adding server: server="*************.global (Addr: *************:4647) (DC: dc1)"
2019-02-19T09:00:04.375-0600 [INFO ] client: using state directory: state_dir=C:\*************\Nomad\agent\data\client
2019-02-19T09:00:04.376-0600 [INFO ] client: using alloc directory: alloc_dir=C:\*************\Nomad\agent\data\alloc
2019-02-19T09:00:04.413-0600 [INFO ] client.fingerprint_mgr.consul: consul agent is available
2019-02-19T09:00:05.980-0600 [WARN ] nomad: raft: Heartbeat timeout from "" reached, starting election
2019-02-19T09:00:05.980-0600 [INFO ] nomad: raft: Node at *************:4647 [Candidate] entering Candidate state in term 3
2019-02-19T09:00:05.997-0600 [INFO ] nomad: raft: Election won. Tally: 1
2019-02-19T09:00:05.997-0600 [INFO ] nomad: raft: Node at *************:4647 [Leader] entering Leader state
2019-02-19T09:00:05.997-0600 [INFO ] nomad: cluster leadership acquired
2019-02-19T09:00:06.957-0600 [INFO ] client.fingerprint_mgr.vault: Vault is available
2019-02-19T09:00:06.957-0600 [INFO ] client.plugin: starting plugin manager: plugin-type=driver
2019-02-19T09:00:06.957-0600 [INFO ] client.plugin: starting plugin manager: plugin-type=device
2019-02-19T09:00:06.960-0600 [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=38bea3f9-1da8-17f4-2af5-3e028721bdc2 task=registry-task error="failed to reattach to docker logger process" task_id=38bea3f9-1da8-17f4-2af5-3e028721bdc2/registry-task/fca8a0c7
2019-02-19T09:00:06.960-0600 [INFO ] client: started client: node_id=f97822b9-ae6a-5d07-957b-0f2fdcf94510
2019-02-19T09:00:06.970-0600 [INFO ] client: node registration complete
2019-02-19T09:00:07.749-0600 [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=172e7207c0081fa9ea1784755c7e6ec5b75c1d802f2513b5d0dc17567cbdafbc```
@cpumanaz Thanks for replying - It's good to know you're killing it with NSSM (and not the process dying unexpectedly). #5335 should cause it to recover correctly, and #5332 should handle the case where logging is entirely unrecoverable and stop us from leaking containers.
Just as a general design note: It would be preferable to avoid killing child processes of the Nomad Client, as any logs emitted between killing the nomad docker_logger
process and restarting the client will not make it into nomad logs
.
@cpumanaz @jsipprell Current master should now be able to recover from missing nomad docker_logger
's correctly - I'll leave this issue open till we ship b3 so we can validate in your environments. Thanks for reporting this!
@dantoml I've updated to b3 in my environment will re-run my test case. Thanks!
@dantoml Looks good in the release 0.9.0 release.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
If you have a question, prepend your issue with
[question]
or preferably use the nomad mailing list.If filing a bug please include the following:
Nomad version
v0.9.0-beta2
Operating system and Environment details
Windows 10 Docker Driver
Issue
Nomad does not re-attach to containers in beta as it does in 0.8.7.
Reproduction steps
run container image, restart nomad. Nomad will attempt to start another container. Port conflicts result.
Expected behavior, nomad re-attaches to running containers and shows status as running.