hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Nomad is unable to start task driver "Reattachment process not found" #5891

Closed jeromegn closed 4 years ago

jeromegn commented 5 years ago

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Operating system and Environment details

Ubuntu 18.04.2 LTS

Issue

We've created a custom task driver and under certain conditions (unclear, likely due to an unclean shutdown) it's impossible for nomad to start it.

I confirmed the task driver process is indeed not running.

I browsed the state.db and found:

{"plugin_state":"ReattachConfigsfirecrackerAddr/tmp/plugin505002494NetworkunixPid8EProtocolgrpc"}

Is there any way to just restart the task driver when this happens? When I cleared the state.db, nomad started our task driver just fine.

Reproduction steps

Not entirely sure.

Nomad Client logs (if appropriate)

<startup messages>
2019-06-26T18:04:42.616Z [ERROR] client.driver_mgr: dispensing initial plugin failed: driver=firecracker error="failed to start plugin: failed to get plugin info for plugin: plugin is shut down"
2019-06-26T18:05:31.612Z [WARN ] client.plugin: timeout waiting for plugin manager to be ready: plugin-type=driver
2019-06-26T18:05:31.614Z [ERROR] client.alloc_runner.task_runner: failed to create driver: alloc_id=03708d7d-df1a-8447-f430-ab76d1185f6e task=app error="failed to start plugin: Reattachment process not found"
2019-06-26T18:05:31.614Z [ERROR] client: error running alloc: error="failed creating runner for task "app": failed to start plugin: Reattachment process not found" alloc_id=03708d7d-df1a-8447-f430-ab76d1185f6e
2019-06-26T18:05:31.614Z [ERROR] client.alloc_runner.task_runner: failed to create driver: alloc_id=1a37257d-3e8b-1457-1300-3c66e3753110 task=app error="failed to start plugin: Reattachment process not found"
2019-06-26T18:05:31.614Z [ERROR] client: error running alloc: error="failed creating runner for task "app": failed to start plugin: Reattachment process not found" alloc_id=1a37257d-3e8b-1457-1300-3c66e3753110

^ this is nomad trying to reconcile the state for all allocations it knows about.

endocrimes commented 5 years ago

@jeromegn hmm thanks for the report - In most cases we should automatically be able to restart plugins, and on first look I'm not sure why we couldn't here.

Could you possibly send a full log for the client startup to nomad-oss-debug@hashicorp.com, preferably at debug level or lower? It'll make it much easier for us to be able to track this down.

Thanks!

jeromegn commented 5 years ago

Sounds good, next time this happens, I'll set it to debug and send you logs.

stale[bot] commented 5 years ago

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:

nickethier commented 4 years ago

@jeromebaude I reopened this issue per your message to the mailing list. Are you still running version 0.9.3 or have you since upgraded?

jeromegn commented 4 years ago

No we’ve been upgrading. We’re on 0.10.2 right now but it’s been happening with every version we tried

On Fri, Jan 31, 2020 at 15:19 Nick Ethier notifications@github.com wrote:

@jeromebaude https://github.com/jeromebaude I reopened this issue per your message to the mailing list. Are you still running version 0.9.3 or have you since upgraded?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/5891?email_source=notifications&email_token=AAAKSPOFFBUKRXLIUP4Z6WTRASBU7A5CNFSM4H3U32WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKP4GHA#issuecomment-580895516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKSPLXGBZFTHVEOEXPYO3RASBU7ANCNFSM4H3U32WA .

nickethier commented 4 years ago

Thanks, I'm going to try and replicate this. Have you learned any other conditions that seem to trigger this. Anything to help narrow down the reproduction case? I'm assuming this driver is not publicly available.

jeromegn commented 4 years ago

It appears pretty random. It doesn't seem to happen on clients with particularly many or few allocs.

The driver is private right now and I'm reluctant to open it in case I mis-committed something at some point! I'll send an email to that oss address with a gist of the source of the most important files.

jeromegn commented 4 years ago

I've found a broken nomad instance and sent the logs to the same address. I also sent good logs for comparison.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.