Closed johnnyplaydrums closed 1 year ago
Hi @johnnyplaydrums! Thanks for opening this issue. I'm glad you got things sorted out by running docker stop
. I suspect that nomad alloc stop
might also have solved that for you.
I don't see anything in the CHANGELOG that looks like it might have fixed what you saw, but this is also the first time we've seen a report like this. Most of the interesting logs for this would have been at the debug level, except for any errors.
I looked through the alloc status
output you provided and I see the Desired Status
and Desired Description
both show that the client got a signal to shutdown the allocation from the server, as we'd expect.
But the Task Events
show that the client thinks it sent the kill signal to the task. If it had sent it and it was unsuccessful, we'd have seen the "Terminated" event like the following:
Recent Events:
Time Type Description
2022-01-10T14:59:36-05:00 Killed Task successfully killed
2022-01-10T14:59:36-05:00 Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-01-10T14:59:31-05:00 Killing Sent interrupt. Waiting 5s before force killing
The "Sent interrupt" event was emitted from lifecycle.go#L79
which suggests the task runner got hung up on the "wait channel" and that the driver may not have ever received the signal. If this was still running, we could take a goroutine dump to figure out the exact spot it was hung up. But it's more important that you got your cluster into correct shape so that's ok.
It looks like we don't set a timeout context at the allocrunner level when we call for the task runner to Kill
the task, but that's because we're expecting a different part of the code to be handling the kill_timeout
.
Also, https://github.com/hashicorp/nomad/commit/61a3b73d44a6e6c3b0c43af28c9745ed764fe940 landed in more recent versions of Nomad, which fixed some race conditions around that same code but wasn't explicitly designed to fix this kind of thing. I took a pass through the code and I don't see any way that can currently happen, but I'd like to keep this issue open until I'm more confident about that. I'll mark this for investigation and see if what we can come up with.
Thanks for the quick response @tgross! I actually did run alloc stop
via the CLI, and tried the same in the UI via the Stop
button, and both reported successfully stopping the allocation, but the allocation was still running.
For example in the UI, after clicking Stop
, the UI transitioned to the Allocation isn't running
message under Resource Utilization
as expected. Interestingly, when I refreshed the page, I got a 404. But when I navigated back to that page via the client page, the page loaded successfully and the allocation was shown as still running. I noticed this throughout my debugging that if I refreshed the allocation page, or tried to load it in a new browser tab by pasting the URL, I would get a 404. But if I navigated to it from the client page, it would load and show the running allocation.
Similarly with the CLI, I don't have the logs handy but it was the normal output from the alloc stop
command, ending with something similar to Evaluation status changed: "pending" -> "complete"
. But checking the alloc status showed it was still running. Maybe there is some hidden error when you try to stop an allocation for a job that doesn't exist? No errors were visible from the UI or CLI. The only way I was able to get resolution was via docker stop
directly.
Let me know if I can be of assistance in any other way! Apologies I wasn't able to grab DEBUG logs that would be useful for your debugging, but I'm happy to answer any questions. Cheers.
I actually did run
alloc stop
via the CLI, and tried the same in the UI via theStop
button, and both reported successfully stopping the allocation, but the allocation was still running.
Oh, right, that actually makes sense given where the code seems to be hung up. The server has already told the client to stop, so telling it again won't do us any good.
Hey folks, I just noticed this github issue is still open. I vote we close this, given how old it is and that we haven't seen this type of issue recur.
Works for me
Nomad version
1.0.4
Operating system and Environment details
Ubuntu 18.04
Issue
There is a non-existent job that is running 2 allocations. When I try to stop the allocation, the UI and CLI confirm it is stopped, but when I check allocation status or navigate back to the allocation page, it's still running with the same alloc id. As you can see in this screenshot, there is no job name:
Here is the alloc status output:
When I ask for the status on that job, it only shows me the status for another similarly named job, because there is no job named just
pangea
:Similarly, when I try to stop this non-existent job, it matches on a similarly named job, and so I have no way to stop and purge the job:
Reproduction steps
Hard to say. We did used to have a job named
pangea
deployed many months ago, but I removed it a while ago vianomad job stop -purge pangea
.Expected Result
A non-existent job shouldn't be able to run allocations.
Actual Result
A non-existent job is running allocations. There are two allocations running that are associated with this job
pangea
When I try to stop the job so that all allocations are removed, it doesn't recognize the job name and only asks to stop a similarly named job. As I was filing this Issue, I thought to try to just rundocker stop
on the allocations directly. This indeed stopped the allocations and now the alloc status shows the correct job status ofTask "pangea" is "dead"
. So that solved my immediate problem and I am totally fine if you'd like to just close out this issue, but I thought it'd be valuable to file. I don't see any related logs from the client or server, except the client acknowledging the allocation is now marked for GC (below). Any other related client / server logs would likely have been from many months ago and I don't have access to them anymore.