Closed matthiasschoger closed 1 year ago
Hi @matthiasschoger! The work done in #12324 should be agnostic to the method used to set the drain mode; it doesn't matter at all whether the API request comes from the CLI or web UI as far as I know so long as all the same options are used.
Can you provide more information about what's happening with the service jobs that aren't being drained?
Hi @tgross, thanks for the prompt reply. Actually, maybe I misused the web UI and this is more a documentation topic.
As you can see from my post, I checked the "Force Drain" toggle in both cases. Could that be the reason why my jobs get stuck during drain (using a CSI driver)? In that case, I think a warning on the tooltip would be nice that checking the "Force Drain" option can result in stuck jobs when using CSI drivers.
Otherwise, I'd be happy to provide logs for the issue, it's quite easy to reproduce. The jobs (docker plugin) are getting stuck and a reboot of the machine is from my experience the only way to get rid of them.
Hi @matthiasschoger unfortunately I'm having a little trouble following what the issue is here. Is the problem that you're seeing a difference between the UI and CLI (as initially reported) or is the problem that -force
is forcing all your allocations to immediately stop regardless of type? Because if it's the latter, that's working as intended. I can certainly make that a little more clear in the drain docs though. (done in https://github.com/hashicorp/nomad/pull/17703)
Meanwhile, I checked the behavior of the UI it does look like there's a subtle difference between the API request bodies between the UI and the CLI, specifically with the Deadline
) field of the Node Drain API.
-force
, the deadline is -1ns (which means we ignore the migrate
block and immediately stop all the containers).Force Drain
, the deadline is -1ns.-force
, the deadline is 1h.Force Drain
and without a Deadline
set, the deadline is 0s (no deadline!). I think that's intentional given that there's a deadline toggle though.Hi @tgross, it seems to be a documentation issue around the interaction of -force and CSI plugins shutting down before the jobs that are using the CSI.
Thank you for looking into it and resolving it quickly.
Nomad version
Output from
nomad version
Nomad v1.5.6Operating system and Environment details
Ubuntu Server 22.10 Linux compute2 5.19.0-45-generic #46-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 7 09:08:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
All the latest patches are installed.
Issue
In #12324, the following was implemented:
I'm currently running a 3-node cluster, with the tasks using the NFS CSI plugin to mount storage from my local NAS.
When draining a node from the UI with both "Force Drain" and "Drain System Jobs" enabled, some jobs consistently get stuck when doing a node drain from the UI.
Issue #12324 seems to address the issue for the command line, but it seems like the issue still exists for a UI drain.
One more observation: Job migration works flawlessly when I just migrate the service jobs, not system.
Reproduction steps
Expected Result
The service jobs migrate away from the drained node.
Actual Result
Service jobs get stuck on the node. Docker fails to stop the containers.
Job file (if appropriate)
Happy to provide logs and job files, but I'm sure you have the relevant info from issue #12324.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)