hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

task runner blocked on task kill #11811

Closed johnnyplaydrums closed 1 year ago

johnnyplaydrums commented 2 years ago

Nomad version

1.0.4

Operating system and Environment details

Ubuntu 18.04

Issue

There is a non-existent job that is running 2 allocations. When I try to stop the allocation, the UI and CLI confirm it is stopped, but when I check allocation status or navigate back to the allocation page, it's still running with the same alloc id. As you can see in this screenshot, there is no job name:

Screen Shot 2022-01-10 at 10 52 56 AM

Here is the alloc status output:

$ nomad-sandbox alloc status 38e1a497-75ff-e4b1-1613-daed2b093535                                                                                                                    
ID                  = 38e1a497-75ff-e4b1-1613-daed2b093535
Eval ID             = 6eec45ce
Name                = pangea.pangea[2]
Node ID             = 28464462
Node Name           = nomad-client-sandbox-172-31-159-232
Job ID              = pangea
Job Version         = 15
Client Status       = running
Client Description  = Tasks are running
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 5mo2d ago
Modified            = 5mo1d ago
Deployment ID       = 61a24f64
Deployment Health   = healthy

Allocation Addresses
Label     Dynamic  Address
~I removed the content in this section, lmk if it's needed for debugging~

Task "pangea" is "running"
Task Resources
CPU         Memory       Disk     Addresses
0/1000 MHz  0 B/5.0 GiB  300 MiB

Task Events:
Started At     = 2021-08-11T15:14:50Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-08-12T02:01:40-07:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-08-11T08:14:50-07:00  Started     Task started by client
2021-08-11T08:13:49-07:00  Driver      Downloading image
2021-08-11T08:13:49-07:00  Task Setup  Building Task Directory
2021-08-11T08:13:43-07:00  Received    Task received by client

When I ask for the status on that job, it only shows me the status for another similarly named job, because there is no job named just pangea:

$ nomad-sandbox job status pangea
ID            = pangea-sandbox-basil
Name          = pangea-sandbox-basil
Submit Date   = 2022-01-06T13:29:41-08:00
Type          = service
Priority      = 50
Datacenters   = aws-sandbox
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group            Queued  Starting  Running  Failed  Complete  Lost
pangea-sandbox-basil  0       0         4        15      65        0

Latest Deployment
ID          = 68bb5219
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group            Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
pangea-sandbox-basil  true         true      4        1         4       4        0          2022-01-06T21:40:30Z

Allocations
ID        Node ID   Task Group            Version  Desired  Status   Created    Modified
3afeefa4  3cff9fbf  pangea-sandbox-basil  22       run      running  3d21h ago  3d21h ago
9e456655  28464462  pangea-sandbox-basil  22       run      running  3d21h ago  3d21h ago
b5f54ed8  28464462  pangea-sandbox-basil  22       run      running  3d21h ago  3d21h ago
77e7ef1f  014ceb36  pangea-sandbox-basil  22       run      running  3d21h ago  3d21h ago

Similarly, when I try to stop this non-existent job, it matches on a similarly named job, and so I have no way to stop and purge the job:

$ nomad-sandbox job stop -purge pangea
Are you sure you want to stop job "pangea-sandbox-basil"? [y/N] n
Cancelling job stop
$

Reproduction steps

Hard to say. We did used to have a job named pangea deployed many months ago, but I removed it a while ago via nomad job stop -purge pangea.

Expected Result

A non-existent job shouldn't be able to run allocations.

Actual Result

A non-existent job is running allocations. There are two allocations running that are associated with this job pangea When I try to stop the job so that all allocations are removed, it doesn't recognize the job name and only asks to stop a similarly named job. As I was filing this Issue, I thought to try to just run docker stop on the allocations directly. This indeed stopped the allocations and now the alloc status shows the correct job status of Task "pangea" is "dead". So that solved my immediate problem and I am totally fine if you'd like to just close out this issue, but I thought it'd be valuable to file. I don't see any related logs from the client or server, except the client acknowledging the allocation is now marked for GC (below). Any other related client / server logs would likely have been from many months ago and I don't have access to them anymore.

2022/01/10 19:16:19.351331 [INFO] (runner) received finish
2022/01/10 19:16:19.351200 [INFO] (runner) stopping
2022-01-10T19:16:19.349Z [INFO] client.gc: marking allocation for GC: alloc_id=38e1a497-75ff-e4b1-1613-daed2b093535
2022/01/10 19:15:26.673305 [INFO] (runner) received finish
2022/01/10 19:15:26.673189 [INFO] (runner) stopping
2022-01-10T19:15:26.671Z [INFO] client.gc: marking allocation for GC: alloc_id=aa689e68-ed42-7499-d4a1-01a655598e6a
tgross commented 2 years ago

Hi @johnnyplaydrums! Thanks for opening this issue. I'm glad you got things sorted out by running docker stop. I suspect that nomad alloc stop might also have solved that for you.

I don't see anything in the CHANGELOG that looks like it might have fixed what you saw, but this is also the first time we've seen a report like this. Most of the interesting logs for this would have been at the debug level, except for any errors.

I looked through the alloc status output you provided and I see the Desired Status and Desired Description both show that the client got a signal to shutdown the allocation from the server, as we'd expect.

But the Task Events show that the client thinks it sent the kill signal to the task. If it had sent it and it was unsuccessful, we'd have seen the "Terminated" event like the following:

Recent Events:
Time                       Type        Description
2022-01-10T14:59:36-05:00  Killed      Task successfully killed
2022-01-10T14:59:36-05:00  Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-01-10T14:59:31-05:00  Killing     Sent interrupt. Waiting 5s before force killing

The "Sent interrupt" event was emitted from lifecycle.go#L79 which suggests the task runner got hung up on the "wait channel" and that the driver may not have ever received the signal. If this was still running, we could take a goroutine dump to figure out the exact spot it was hung up. But it's more important that you got your cluster into correct shape so that's ok.

It looks like we don't set a timeout context at the allocrunner level when we call for the task runner to Kill the task, but that's because we're expecting a different part of the code to be handling the kill_timeout.

Also, https://github.com/hashicorp/nomad/commit/61a3b73d44a6e6c3b0c43af28c9745ed764fe940 landed in more recent versions of Nomad, which fixed some race conditions around that same code but wasn't explicitly designed to fix this kind of thing. I took a pass through the code and I don't see any way that can currently happen, but I'd like to keep this issue open until I'm more confident about that. I'll mark this for investigation and see if what we can come up with.

johnnyplaydrums commented 2 years ago

Thanks for the quick response @tgross! I actually did run alloc stop via the CLI, and tried the same in the UI via the Stop button, and both reported successfully stopping the allocation, but the allocation was still running.

For example in the UI, after clicking Stop, the UI transitioned to the Allocation isn't running message under Resource Utilization as expected. Interestingly, when I refreshed the page, I got a 404. But when I navigated back to that page via the client page, the page loaded successfully and the allocation was shown as still running. I noticed this throughout my debugging that if I refreshed the allocation page, or tried to load it in a new browser tab by pasting the URL, I would get a 404. But if I navigated to it from the client page, it would load and show the running allocation.

Similarly with the CLI, I don't have the logs handy but it was the normal output from the alloc stop command, ending with something similar to Evaluation status changed: "pending" -> "complete". But checking the alloc status showed it was still running. Maybe there is some hidden error when you try to stop an allocation for a job that doesn't exist? No errors were visible from the UI or CLI. The only way I was able to get resolution was via docker stop directly.

Let me know if I can be of assistance in any other way! Apologies I wasn't able to grab DEBUG logs that would be useful for your debugging, but I'm happy to answer any questions. Cheers.

tgross commented 2 years ago

I actually did run alloc stop via the CLI, and tried the same in the UI via the Stop button, and both reported successfully stopping the allocation, but the allocation was still running.

Oh, right, that actually makes sense given where the code seems to be hung up. The server has already told the client to stop, so telling it again won't do us any good.

johnnyplaydrums commented 1 year ago

Hey folks, I just noticed this github issue is still open. I vote we close this, given how old it is and that we haven't seen this type of issue recur.

tgross commented 1 year ago

Works for me