Open crystalin opened 6 months ago
Hi @crystalin 👋
Thanks for the report!
Were the client draining when the shutdown happened? The status description Desired Description = alloc is being migrated
may indicate so, but I wanted to double check that with you.
This happened multiple times. Sometime when I did a drain before (but waited for it to end before powering down) but most of the time it wasn't a drain, simply a power cut.
Ah ok, thanks for the extra info.
And if you do a docker ps
do you see the containers listed?
No, the process was not running. The way I see it is:
Thank you for the confirmation.
Under normal circumstances a Nomad client will attempt to reattach to any running process it spawned earlier and resume managing its lifecycle.
But I noticed that your Nomad agent is running as both client and server, which may have caused a situation where the recovery process did not complete successfully.
The errors that read API error (500): error while creating mount source path '/opt/nomad/data/alloc/ba67fcb1-8da8-073c-ccbb-16570a00e17b/alloc': mkdir /opt/nomad: read-only file system
are also a little weird.
Is /opt/nomad
mounting an external volume, or misconfigured somehow?
This error happened because I had 2 docker version installed and at reboot it it started both. This might have trigger the issue this time of not allowing to re-attach the allocation
Oh that's interesting. I can see how connecting to a different Docker daemon could cause problems on reboot.
Would you be able to uninstall one of the versions and check if another reboot causes the problem again?
I'll try that. Do you know a good way to clean the orphan allocations? Right now I have to manually open the db and delete the allocation bucket
The issue is still happening. This time it wasn't without draining, just restarting the server with a shutdown -P
.
Additionally, it happens also on the server that is not running the server. (I restarted both but I see double allocation for a service on the 2nd client)
What is the best way to power down and up the machine without having those ?
Nomad version
Operating system and Environment details
Ubuntu 22
Issue
On a server running also a client: After power reboot, some allocations are still appearing as running, even if the process isn't running. The service didn't start because it uses a static port (which is already reserved according to the nomad allocation)
Trying to kill the allocation doesn't work:
nomad alloc stop --namespace default -no-shutdown-delay -verbose 22434531-6a4e-1103-37f4-0f302b2b2549
nomad alloc status --namespace default 22434531-6a4e-1103-37f4-0f302b2b2549
Services
nomad service list
Reproduction steps
Have many services on a server/client combo and power off/on without proper shutdown. Run
nomad alloc status --namespace default <alloc_id>
Expected Result
The allocation should get removed
Actual Result
The allocation stays forever, preventing to actually launch the service
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)