Nomad fails to re-attach to containers after restart.

cpumanaz commented 5 years ago

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

v0.9.0-beta2

Operating system and Environment details

Windows 10 Docker Driver

Issue

Nomad does not re-attach to containers in beta as it does in 0.8.7.

Reproduction steps

run container image, restart nomad. Nomad will attempt to start another container. Port conflicts result.

Expected behavior, nomad re-attaches to running containers and shows status as running.

jsipprell commented 5 years ago

I'm seeing this happen on linux as well. When a job with a docker container is running on a host and nomad is restarted, the docker driver responds with something like:

driver start failed: Failed to start container 1acb6f0fff52a975b35b1c17a7318accf8a0d49d5eecedfe15e6028ec0492182: API error (500): driver failed programming external connectivity on endpoint e9fa01ca-2ae0-4fa9-9b6b-3406be8f1636_movienight_7f3f9520 (1c62edb47de592450de37d30b6ee288822f72c464fe123f0627a87d301750889): Bind for 10.178.85.207:31629 failed: port is already allocated

Once this occurs, the container must be manually stopped.

notnoop commented 5 years ago

Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was it using Windows containers or Linux ones? Also, was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?

We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprell.

cpumanaz commented 5 years ago

Mahmood,

Thanks for reaching out. I’ll get a job spec out to you and document the reproduction steps.

-Matt Shooshtari.

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

From: Mahmood Ali notifications@github.com Sent: Sunday, February 17, 2019 8:11:12 AM To: hashicorp/nomad Cc: Matthew Shooshtari; Author Subject: Re: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)

Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?

We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprellhttps://github.com/jsipprell.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/hashicorp/nomad/issues/5296#issuecomment-464461893, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMoQ-XVfjHgeE95TWLxl5FAmO2WWK83fks5vOWMAgaJpZM4agSfG.

cpumanaz commented 5 years ago

Mahmood,

The reproduction steps are as follows. Start the job using the job definition below.

Stop nomad. I use a windows service with NSSM, but you can kill the nomad agent process. Start nomad. Nomad tries to recover, but instead of re-attaching, it tries to download and start a new instance.

-Matt

Here are the log entries.

02/18/19 10:34:53

Not Restarting

Error was unrecoverable

02/18/19 10:34:53

Driver Failure

driver start failed: Failed to start container fd25bc67a8052fb483992e328f08db538805b337db7ddd91b659fd2f979b619b: API error (500): failed to create endpoint 9a9d4a94-845b-1bf1-402e-d5d3a2571b22_registry-task_fb9a1522 on network nat: hnsCall failed in Win32: The specified port already exists. (0x803b0013)

02/18/19 10:34:47

Driver

Downloading image

02/18/19 10:28:57

Started

Task started by client

02/18/19 10:28:52

Driver

Downloading image

02/18/19 10:28:52

Task Setup

Building Task Directory

02/18/19 10:28:52

Received

Task received by client

Here is the JSON job definition.

{ "Affinities": null, "AllAtOnce": false, "Constraints": null, "CreateIndex": 8, "Datacenters": [ "dc1" ], "Dispatched": false, "ID": "docker-registry-job", "JobModifyIndex": 8, "Meta": null, "ModifyIndex": 65, "Name": "docker-registry-job", "Namespace": "default", "ParameterizedJob": null, "ParentID": "", "Payload": null, "Periodic": null, "Priority": 50, "Region": "global", "Spreads": null, "Stable": true, "Status": "pending", "StatusDescription": "", "Stop": false, "SubmitTime": 1550507328984446200, "TaskGroups": [ { "Affinities": null, "Constraints": null, "Count": 1, "EphemeralDisk": { "Migrate": false, "SizeMB": 300, "Sticky": false }, "Meta": null, "Migrate": { "HealthCheck": "checks", "HealthyDeadline": 300000000000, "MaxParallel": 1, "MinHealthyTime": 10000000000 }, "Name": "registry-group", "ReschedulePolicy": { "Attempts": 0, "Delay": 30000000000, "DelayFunction": "exponential", "Interval": 0, "MaxDelay": 3600000000000, "Unlimited": true }, "RestartPolicy": { "Attempts": 2, "Delay": 15000000000, "Interval": 1800000000000, "Mode": "fail" }, "Spreads": null, "Tasks": [ { "Affinities": null, "Artifacts": null, "Config": { "port_map": [ { "HTTP": 5000 } ], "image": "stefanscherer/registry-windows:latest", "mounts": [ { "readonly": false, "source": "c:/registry", "target": "c:/registry", "type": "bind" } ] }, "Constraints": null, "DispatchPayload": null, "Driver": "docker", "Env": null, "KillSignal": "", "KillTimeout": 5000000000, "Leader": false, "LogConfig": { "MaxFileSizeMB": 10, "MaxFiles": 10 }, "Meta": null, "Name": "registry-task", "Resources": { "CPU": 500, "Devices": null, "DiskMB": 0, "IOPS": 0, "MemoryMB": 1000, "Networks": [ { "CIDR": "", "Device": "", "DynamicPorts": null, "IP": "", "MBits": 100, "ReservedPorts": [ { "Label": "HTTP", "Value": 5000 } ] } ] }, "Services": [ { "AddressMode": "auto", "CanaryTags": null, "Checks": [ { "AddressMode": "", "Args": null, "CheckRestart": null, "Command": "", "GRPCService": "", "GRPCUseTLS": false, "Header": null, "InitialStatus": "", "Interval": 10000000000, "Method": "", "Name": "docker-registry-check", "Path": "", "PortLabel": "", "Protocol": "", "TLSSkipVerify": false, "Timeout": 2000000000, "Type": "tcp" } ], "Name": "docker-registry", "PortLabel": "HTTP", "Tags": null } ], "ShutdownDelay": 0, "Templates": null, "User": "", "Vault": null } ], "Update": { "AutoRevert": true, "Canary": 0, "HealthCheck": "checks", "HealthyDeadline": 900000000000, "MaxParallel": 1, "MinHealthyTime": 10000000000, "ProgressDeadline": 1200000000000, "Stagger": 30000000000 } } ], "Type": "service", "Update": { "AutoRevert": false, "Canary": 0, "HealthCheck": "", "HealthyDeadline": 0, "MaxParallel": 1, "MinHealthyTime": 0, "ProgressDeadline": 0, "Stagger": 30000000000 }, "VaultToken": "", "Version": 0 }

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

From: Mahmood Ali notifications@github.com Sent: Sunday, February 17, 2019 8:11:12 AM To: hashicorp/nomad Cc: Matthew Shooshtari; Author Subject: Re: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)

Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?

We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprellhttps://github.com/jsipprell.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/hashicorp/nomad/issues/5296#issuecomment-464461893, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMoQ-XVfjHgeE95TWLxl5FAmO2WWK83fks5vOWMAgaJpZM4agSfG.

cpumanaz commented 5 years ago

Here is my system configuration.


  "data_dir": "$installPath/agent/data",
  "datacenter": "dc1",
  "leave_on_interrupt": true,
  "leave_on_terminate": true,
  "server": [
    {
      "bootstrap_expect": 1,
      "enabled": true
    }
  ],
  "client": [
    {
      "enabled": true
    }
  ],
  "acl": [
    {
      "enabled": "$aclEnabled"
    }
  ],
  "vault": [
    {
      "enabled": true,
      "address": "http://$($serverIP):8200",
      "token": "$vaultToken"
    }
  ],
  "consul": [
    {
      "server_auto_join": false,
      "address": "$($serverIP):8500"
    }
  ]
}```
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Matthew Shooshtari<mailto:matthew.shooshtari@outlook.com>
Sent: Monday, February 18, 2019 10:39 AM
To: hashicorp/nomad<mailto:reply@reply.github.com>; hashicorp/nomad<mailto:nomad@noreply.github.com>
Cc: Author<mailto:author@noreply.github.com>
Subject: RE: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)

Mahmood,

The reproduction steps are as follows.  Start the job using the job definition below.

Stop nomad. I use a windows service with NSSM, but you can kill the nomad agent process.
Start nomad.  Nomad tries to recover, but instead of re-attaching, it tries to download and start a new instance.

-Matt

Here are the log entries.

02/18/19 10:34:53

Not Restarting

Error was unrecoverable

02/18/19 10:34:53

Driver Failure

driver start failed: Failed to start container fd25bc67a8052fb483992e328f08db538805b337db7ddd91b659fd2f979b619b: API error (500): failed to create endpoint 9a9d4a94-845b-1bf1-402e-d5d3a2571b22_registry-task_fb9a1522 on network nat: hnsCall failed in Win32: The specified port already exists. (0x803b0013)

02/18/19 10:34:47

Driver

Downloading image

02/18/19 10:28:57

Started

Task started by client

02/18/19 10:28:52

Driver

Downloading image

02/18/19 10:28:52

Task Setup

Building Task Directory

02/18/19 10:28:52

Received

Task received by client

Here is the JSON job definition.

{
  "Affinities": null,
  "AllAtOnce": false,
  "Constraints": null,
  "CreateIndex": 8,
  "Datacenters": [
    "dc1"
  ],
  "Dispatched": false,
  "ID": "docker-registry-job",
  "JobModifyIndex": 8,
  "Meta": null,
  "ModifyIndex": 65,
  "Name": "docker-registry-job",
  "Namespace": "default",
  "ParameterizedJob": null,
  "ParentID": "",
  "Payload": null,
  "Periodic": null,
  "Priority": 50,
  "Region": "global",
  "Spreads": null,
  "Stable": true,
  "Status": "pending",
  "StatusDescription": "",
  "Stop": false,
  "SubmitTime": 1550507328984446200,
  "TaskGroups": [
    {
      "Affinities": null,
      "Constraints": null,
      "Count": 1,
      "EphemeralDisk": {
        "Migrate": false,
        "SizeMB": 300,
        "Sticky": false
      },
      "Meta": null,
      "Migrate": {
        "HealthCheck": "checks",
        "HealthyDeadline": 300000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 10000000000
      },
      "Name": "registry-group",
      "ReschedulePolicy": {
        "Attempts": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "Interval": 0,
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },
      "RestartPolicy": {
        "Attempts": 2,
        "Delay": 15000000000,
        "Interval": 1800000000000,
        "Mode": "fail"
      },
      "Spreads": null,
      "Tasks": [
        {
          "Affinities": null,
          "Artifacts": null,
          "Config": {
            "port_map": [
              {
                "HTTP": 5000
              }
            ],
            "image": "stefanscherer/registry-windows:latest",
            "mounts": [
              {
                "readonly": false,
                "source": "c:/WatchGuardVideo/registry",
                "target": "c:/registry",
                "type": "bind"
              }
            ]
          },
          "Constraints": null,
          "DispatchPayload": null,
          "Driver": "docker",
          "Env": null,
          "KillSignal": "",
          "KillTimeout": 5000000000,
          "Leader": false,
          "LogConfig": {
            "MaxFileSizeMB": 10,
            "MaxFiles": 10
          },
          "Meta": null,
          "Name": "registry-task",
          "Resources": {
            "CPU": 500,
            "Devices": null,
            "DiskMB": 0,
            "IOPS": 0,
            "MemoryMB": 1000,
            "Networks": [
              {
                "CIDR": "",
                "Device": "",
                "DynamicPorts": null,
                "IP": "",
                "MBits": 100,
                "ReservedPorts": [
                  {
                    "Label": "HTTP",
                    "Value": 5000
                  }
                ]
              }
            ]
          },
          "Services": [
            {
              "AddressMode": "auto",
              "CanaryTags": null,
              "Checks": [
                {
                  "AddressMode": "",
                  "Args": null,
                  "CheckRestart": null,
                  "Command": "",
                  "GRPCService": "",
                  "GRPCUseTLS": false,
                  "Header": null,
                  "InitialStatus": "",
                  "Interval": 10000000000,
                  "Method": "",
                  "Name": "docker-registry-check",
                  "Path": "",
                  "PortLabel": "",
                  "Protocol": "",
                  "TLSSkipVerify": false,
                  "Timeout": 2000000000,
                  "Type": "tcp"
                }
              ],
              "Name": "docker-registry",
              "PortLabel": "HTTP",
              "Tags": null
            }
          ],
          "ShutdownDelay": 0,
          "Templates": null,
          "User": "",
          "Vault": null
        }
      ],
      "Update": {
        "AutoRevert": true,
        "Canary": 0,
        "HealthCheck": "checks",
        "HealthyDeadline": 900000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 10000000000,
        "ProgressDeadline": 1200000000000,
        "Stagger": 30000000000
      }
    }
  ],
  "Type": "service",
  "Update": {
    "AutoRevert": false,
    "Canary": 0,
    "HealthCheck": "",
    "HealthyDeadline": 0,
    "MaxParallel": 1,
    "MinHealthyTime": 0,
    "ProgressDeadline": 0,
    "Stagger": 30000000000
  },
  "VaultToken": "",
  "Version": 0
}

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Mahmood Ali <notifications@github.com>
Sent: Sunday, February 17, 2019 8:11:12 AM
To: hashicorp/nomad
Cc: Matthew Shooshtari; Author
Subject: Re: [hashicorp/nomad] Nomad fails to re-attach to containers after restart. (#5296)

Thanks for reporting, that seems like a bug indeed. Mind if you submit a reproduction job spec and steps? Was this a clean stop nomad agent followed by restart? Or in-place upgrade of nomad?

We are investigating the Windows repro, but failed to reproduce it with Linux @jsipprell<https://github.com/jsipprell>.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://github.com/hashicorp/nomad/issues/5296#issuecomment-464461893>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AMoQ-XVfjHgeE95TWLxl5FAmO2WWK83fks5vOWMAgaJpZM4agSfG>.

endocrimes commented 5 years ago

@cpumanaz Hey - I've been able to reproduce it fairly intermittently when our docker log collection process is being stopped.

I'd be curious to know how consistently it happens for you, and how you're killing the nomad process? (and if you have them, any logs)

cpumanaz commented 5 years ago

@cpumanaz Hey - I've been able to reproduce it fairly intermittently when our docker log collection process is being stopped.

I'd be curious to know how consistently it happens for you, and how you're killing the nomad process? (and if you have them, any logs)

I am killing the process with NSSM. I'm not actually sure how NSSM stops it. I do have the flags to do a graceful stop, and the log indicates that is occurring. I believe you are correct about logging. Here are my logs.

I am able to reproduce this consistently.


    2019-02-19T08:59:05.731-0600 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad.exe: opening fifo: alloc_id=38bea3f9-1da8-17f4-2af5-3e028721bdc2 task=registry-task @module=logmon path=//./pipe/registry-task-1b221f34.stderr timestamp=2019-02-19T08:59:05.731-0600
    2019-02-19T08:59:06.593-0600 [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=2d54be79db74c0e312af2e22a14393f78a774b2daf5751f8308f5dc64dccdb4a
    2019-02-19T08:59:08.485-0600 [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=2d54be79db74c0e312af2e22a14393f78a774b2daf5751f8308f5dc64dccdb4a
==> Caught signal: interrupt
==> Gracefully shutting down agent...
    2019-02-19T08:59:53.051-0600 [INFO ] nomad: server starting leave
    2019-02-19T08:59:53.051-0600 [INFO ] nomad: serf: EventMemberLeave: *************.global *************
    2019-02-19T08:59:53.051-0600 [INFO ] nomad: removing server: server="*************.global (Addr: *************:4647) (DC: dc1)"
    2019-02-19T08:59:53.051-0600 [ERROR] nomad: failed to reconcile member: member="{*************.global ************* 4648 map[build:0.9.0-beta2 id:d3caed5a-2a5a-e2f8-382e-7e6d4e4fb521 rpc_addr:************* bootstrap:1 region:global mvn:1 role:nomad dc:dc1 vsn:1 raft_vsn:2 port:4647] left 1 5 2 2 5 4}" error="No servers found"
    2019-02-19T08:59:54.052-0600 [INFO ] agent: requesting shutdown
    2019-02-19T08:59:54.052-0600 [INFO ] client: shutting down
    2019-02-19T08:59:54.062-0600 [INFO ] client.plugin: shutting down plugin manager: plugin-type=device
    2019-02-19T08:59:54.063-0600 [INFO ] client.plugin: plugin manager finished: plugin-type=device
    2019-02-19T08:59:54.063-0600 [INFO ] client.plugin: shutting down plugin manager: plugin-type=driver
    2019-02-19T08:59:54.135-0600 [INFO ] client.plugin: plugin manager finished: plugin-type=driver
    2019-02-19T08:59:54.152-0600 [INFO ] nomad: shutting down server
    2019-02-19T08:59:54.223-0600 [INFO ] agent: shutdown complete
==> Loaded configuration from C:\*************\Nomad\agent\agent.json
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: *************:4646; RPC: *************:4647; Serf: *************:4648
            Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: true
               Version: 0.9.0-beta2

==> Nomad agent started! Log data will stream in below:

    2019-02-19T09:00:04.349-0600 [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\*************\Nomad\agent\data\plugins
    2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2019-02-19T09:00:04.350-0600 [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2019-02-19T09:00:04.373-0600 [INFO ] nomad: raft: Initial configuration (index=1): [{Suffrage:Voter ID:*************:4647 Address:*************:4647}]
    2019-02-19T09:00:04.373-0600 [INFO ] nomad: raft: Node at *************:4647 [Follower] entering Follower state (Leader: "")
    2019-02-19T09:00:04.374-0600 [INFO ] nomad: serf: Ignoring previous leave in snapshot
    2019-02-19T09:00:04.375-0600 [INFO ] nomad: serf: EventMemberJoin: *************.global *************
    2019-02-19T09:00:04.375-0600 [INFO ] nomad: starting scheduling worker(s): num_workers=12 schedulers="[system service batch _core]"
    2019-02-19T09:00:04.375-0600 [WARN ] nomad: serf: Failed to re-join any previously known node
    2019-02-19T09:00:04.375-0600 [INFO ] nomad: adding server: server="*************.global (Addr: *************:4647) (DC: dc1)"
    2019-02-19T09:00:04.375-0600 [INFO ] client: using state directory: state_dir=C:\*************\Nomad\agent\data\client
    2019-02-19T09:00:04.376-0600 [INFO ] client: using alloc directory: alloc_dir=C:\*************\Nomad\agent\data\alloc
    2019-02-19T09:00:04.413-0600 [INFO ] client.fingerprint_mgr.consul: consul agent is available
    2019-02-19T09:00:05.980-0600 [WARN ] nomad: raft: Heartbeat timeout from "" reached, starting election
    2019-02-19T09:00:05.980-0600 [INFO ] nomad: raft: Node at *************:4647 [Candidate] entering Candidate state in term 3
    2019-02-19T09:00:05.997-0600 [INFO ] nomad: raft: Election won. Tally: 1
    2019-02-19T09:00:05.997-0600 [INFO ] nomad: raft: Node at *************:4647 [Leader] entering Leader state
    2019-02-19T09:00:05.997-0600 [INFO ] nomad: cluster leadership acquired
    2019-02-19T09:00:06.957-0600 [INFO ] client.fingerprint_mgr.vault: Vault is available
    2019-02-19T09:00:06.957-0600 [INFO ] client.plugin: starting plugin manager: plugin-type=driver
    2019-02-19T09:00:06.957-0600 [INFO ] client.plugin: starting plugin manager: plugin-type=device
    2019-02-19T09:00:06.960-0600 [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=38bea3f9-1da8-17f4-2af5-3e028721bdc2 task=registry-task error="failed to reattach to docker logger process" task_id=38bea3f9-1da8-17f4-2af5-3e028721bdc2/registry-task/fca8a0c7
    2019-02-19T09:00:06.960-0600 [INFO ] client: started client: node_id=f97822b9-ae6a-5d07-957b-0f2fdcf94510
    2019-02-19T09:00:06.970-0600 [INFO ] client: node registration complete
    2019-02-19T09:00:07.749-0600 [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=172e7207c0081fa9ea1784755c7e6ec5b75c1d802f2513b5d0dc17567cbdafbc```

endocrimes commented 5 years ago

@cpumanaz Thanks for replying - It's good to know you're killing it with NSSM (and not the process dying unexpectedly). #5335 should cause it to recover correctly, and #5332 should handle the case where logging is entirely unrecoverable and stop us from leaking containers.

Just as a general design note: It would be preferable to avoid killing child processes of the Nomad Client, as any logs emitted between killing the nomad docker_logger process and restarting the client will not make it into nomad logs.

endocrimes commented 5 years ago

@cpumanaz @jsipprell Current master should now be able to recover from missing nomad docker_logger's correctly - I'll leave this issue open till we ship b3 so we can validate in your environments. Thanks for reporting this!

cpumanaz commented 5 years ago

@dantoml I've updated to b3 in my environment will re-run my test case. Thanks!

cpumanaz commented 5 years ago

@dantoml Looks good in the release 0.9.0 release.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad