hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

nomad loses processes created by 0.8.x version when upgraded to 0.9.3 when exec driver used #5848

Closed tantra35 closed 5 years ago

tantra35 commented 5 years ago

Nomad version

Nomad v0.9.3 (e0fce4603bdc0e42578e5ede3b1fbafe6499d9bb+CHANGES)

Issue

When upgraded from version 0.8.x to 0.9.3 nomad agent loses processes created by exec driver, with follow logs on agent side:

2019-06-18T15:31:31.430+0300 [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:/bin/fluent-bit Args:[-c /local/td-agent-bit.conf]}"
2019-06-18T15:31:31.517+0300 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=bb185597-a22e-51bb-3d01-2f4822ec254e task=fluend error="failed to launch command with executor: rpc error: code = Unknown desc = container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/var/lib/nomad/alloc/bb185597-a22e-51bb-3d01-2f4822ec254e/fluend\\\" at \\\"/proc\\\" caused \\\"device or resource busy\\\"\"""
2019-06-18T15:31:31.517+0300 [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=bb185597-a22e-51bb-3d01-2f4822ec254e task=fluend reason="Error was unrecoverable"
2019/06/18 15:31:31.550724 [INFO] (runner) stopping
2019/06/18 15:31:31.550746 [INFO] (runner) received finish
2019-06-18T15:31:31.631+0300 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=bb185597-a22e-51bb-3d01-2f4822ec254e task=diamondbcapacitycollector error="failed to launch command with executor: rpc error: code = Unknown desc = container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/var/lib/nomad/alloc/bb185597-a22e-51bb-3d01-2f4822ec254e/diamondbcapacitycollector\\\" at \\\"/proc\\\" caused \\\"device or resource busy\\\"\"""
2019-06-18T15:31:31.632+0300 [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=bb185597-a22e-51bb-3d01-2f4822ec254e task=diamondbcapacitycollector reason="Error was unrecoverable"
2019-06-18T15:31:31.635+0300 [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=bb185597-a22e-51bb-3d01-2f4822ec254e task=diamondbcapacitycollector error="read |0: file already closed"

and orphaned proceses stay running, despite too that for example we stop the job

here is output of ps auxf(to demonstrate orphaned processes from previous version(0.8.6 in our case))

root      3040  0.1  2.8 423988 28476 ?        Ssl  15:32   0:01 /opt/nomad/nomad_0.8.6-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/1f722dcd-fa5b-8e05-0aa8-38211da59897/diamondbcapacitycollector/executor.out","LogLevel":"INFO"}
root      3056  0.1  2.6 489524 26860 ?        Ssl  15:32   0:01 /opt/nomad/nomad_0.8.6-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/1f722dcd-fa5b-8e05-0aa8-38211da59897/fluend/executor.out","LogLevel":"INFO"}
preetapan commented 5 years ago

@tantra35 can you share more details on your job specs? We tested the upgrade/restore path with some manual and integration tests but its possible we missed some edge cases. Sharing more information will help us debug.

tantra35 commented 5 years ago

seems that this happens due our solution to fix https://github.com/hashicorp/nomad/issues/2504 in 0.8.6(in discussion i mention what we did to workaround remounting root dev in readonly ). So now i think that this in not actually a bug for original version of nomad

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.