process not managed by Nomad killed on restart

yigit-erkoca commented 2 months ago

Nomad version

Nomad v1.7.7 BuildDate 2024-04-16T19:26:43Z Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

Windows 10 Enterprise 2016 LTSB

Issue

Nomad instance is killing another process on the system that is not deployed by nomad.

I see in the application event viewer log the following log from source "Process Exit Monitor" 2024-09-16 23:20:06 The process 'C:\simulator\BoardSimulator.exe' was terminated by the process 'C:\Program Files\Infra\Nomad\nomad.exe' with termination code 1. The creation time for the exiting process was 0x01db07ac4767577e.

Reproduction steps

On system reboot, some applications like simulator executables start by other windows services, as well as nomad instance. This is a sporadic issue, does not happen at every reboot, but maybe estimated to be once in every ~100.

Expected Result

Nomad not killing/interfering any other process on the system that is not deployed by using nomad jobs.

Actual Result

Nomad is terminating other processes that are running in the system.

Job file (if appropriate)

There is only one job using raw_exec. We are just setting it up. NomadJob.txt

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

nomad.log The timestamp of the process termination according to the eventlog: 2024-09-16 23:20:06

jrasell commented 2 months ago

Hi @yigit-erkoca and thanks for raising this issue. Nomad should certainly not be terminating processes it does not control, however, having looked at the client logs nothing stands out initially. I'll mark this for further investigation and add it to our backlog. If you do get any additional information that might be useful, please add it to this issue.

yigit-erkoca commented 2 months ago

Hi @yigit-erkoca and thanks for raising this issue. Nomad should certainly not be terminating processes it does not control, however, having looked at the client logs nothing stands out initially. I'll mark this for further investigation and add it to our backlog. If you do get any additional information that might be useful, please add it to this issue.

Hello @jrasell, thank you for your attention! Indeed I'm extending the logs to find out the root cause. I have a setup that I can reproduce this issue quite easily. I'll add more details to the issue and let you know. I appreciate the help.

tgross commented 1 month ago

It looks like the relevant portion of the logs is the following, which suggests that the spuriously killed process is getting killed while we're trying to restore the process for the previously-running task.

2024-09-16T23:20:06.267+0200 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task type=Received msg="Task received by client" failed=false 2024-09-16T23:20:06.268+0200 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=69909b8d-3d22-e823-e33b-ff669048d994/infra-cluster-package-server-task/15a41d28 2024-09-16T23:20:06.268+0200 [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task error="failed to reattach to executor: error creating rpc client for executor plugin: Reattachment process not found" task_id=69909b8d-3d22-e823-e33b-ff669048d994/infra-cluster-package-server-task/15a41d28 2024-09-16T23:20:06.268+0200 [TRACE] client.alloc_runner.task_runner: failed to reattach to task; will not run until server is contacted: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task

What we see here is that the client starts up, looks in its local state, and tries to restore handles to the tasks that were running before the host shutdown (ref task_runner.go#L1290-L1321). Of course, none of them are going to be running, so that recovery will fail as expected because we fail to reattach to the executor process (ref driver.go#L288-L297).

We then call DestroyTask which should be returning ErrTaskNotFound (ref driver.go#L472) because the task handle cannot exist in the driver's state at this point. We won't see that error in the logs because we expect it so we explicitly ignore it (ref task_runner.go#L1310-L1312).

While we have saved the PID of the previously-running task in the local state, I don't see any code path where this PID is actually being made available for the driver to kill.

@yigit-erkoca you say you have a reliable reproduction, right? Can you confirm that the PID of the process getting incorrectly killed the same as the PID of the previously-running task or its executor process?

tgross commented 1 month ago

I did a little more testing and what's particularly weird about all of this is that the executor process is the one that actually sends the kill signal to the child process (or the Windows equivalent which is TerminateProcess). So if the executor process is gone -- which it must be because we rebooted -- then there's nothing to send a kill signal to another child process. It would definitely be helpful to know whether the PID we're trying to kill is the executor or previously-running task.

yigit-erkoca commented 1 month ago

@tgross, thank you for looking into this!

We have indeed a quite reliable reproduction environment. We have reproduced it again today to provide you more information with an example.

Nomad is installed as windows service, and starts normally, starts all the executors, tasks, etc.

All the executors listed below; please note the highlighted instance (which will be the one we will be interested in)

After restarting/rebooting nomad will try to restore the tasks, and one of these occasions if there is any other process with the same PID above "5652" in this case notepad.exe gets the "5652".

Nomad during the restore action, it will kill the notepad.exe. Which can be seen in the windows event viewer below.

I hope this clarifies your question. From this, it looks like the PID is the executor's PID, not the task's PID.

If you need any further information, or things that we can try, please let us we can use our reproduction environment.

tgross commented 1 month ago

Thanks @yigit-erkoca, that's super helpful! That cuts the size of the problem in half and points me towards the go-plugin interface. I'll investigate further and report back.

tgross commented 1 month ago

I did a little bit of parallel investigation on Linux, and using killsnoop I verified that we never send a kill signal to the executor process PID after host restart (or any other PID, for that matter). So that further narrows this down to some quirk of Windows-specific behavior. That's a little weird, because this code should be mostly shared between them... the differences are in the executors themselves, which are not running at the time this all happens.

tgross commented 1 month ago

Looks like I found the culprit, and it totally can happen on other platforms. It's this line in go-plugin: cmd_reattach.go#L31.

We have a PID for the executor and try to reattach
os.FindProcess finds a process for that PID, just not the one we want
We can't connect to it because it's not an executor
go-plugin says, "oops, executor must be broken, let's kill it"

I'm having a chat with colleagues about the best approach to fix this. The obvious fix is to simply delete that line, but I assume there's a reason it exists. :grin: (Edit: this is an extremely old behavior https://github.com/hashicorp/go-plugin/pull/4 which makes it even trickier)

tgross commented 1 month ago

I've got the go-plugin bug fix in https://github.com/hashicorp/go-plugin/pull/320. It may take a little bit of discussion to get this landed, as go-plugin is widely used across our products. I suspect it won't get into 1.9.1 which we're going to ship as early as feasible next week, but it'll certainly be in the next release if not.

yigit-erkoca commented 1 month ago

@tgross thank you very much for the finding and the fix! As soon as we get it released we will also test it. We will build a nomad with this fix and let you know.

tgross commented 1 month ago

Dependency update is here: https://github.com/hashicorp/nomad/pull/24265

peter-harmann-tfs commented 2 weeks ago

@tgross Hello. We are running the fixed version, but it seems our processes are still being killed, this time likely from here.

We see "plugin failed to exit gracefully" in nomad logs just as our process ends unexpectedly. We are not able to reliably reproduce the issue at the moment (seems to occur under different circumstances than the original).

tgross commented 2 weeks ago

@peter-harmann-tfs just to confirm, you mean an unrelated process is getting killed? That is, one not managed by Nomad?

peter-harmann-tfs commented 2 weeks ago

@tgross Yes, it is an unrelated process.

Maybe to clarify, I am colleague of yigit-erkoca, so I am talking about the same deployment as the original issue.

tgross commented 2 weeks ago

Ok, thanks. I'll reopen and get this on the queue for another in-depth investigation.

cenk-saldirim commented 1 week ago

Hi, I am colleague of yigit-erkoca and peter-harmann-tfs.

We conducted a more detailed investigation regarding this issue, in order to provide steps to reproduce it. When we correlated the time we identified a non-nomad process being killed with the timestamp in the Nomad logs, we observed the following log messages in sequence: 'plugin failed to exit gracefully' and 'reattached plugin process exited'.

We extended the logs in the reattach() and Kill() functions in the client.go file. This allowed us to log the reattach PID and reattach address information for the process that is being terminated.

The issue is finally reproduced, and observed that it occurs when the following sequence of events takes place:

Nomad has reattach PID and reattach Address information for the executor and try to reattach
The executor process has exited and a different process has spawned with the same PID. Additionally the reattach address is occupied by a different process (It's a rare scenario, but it can happen)
The reattach address is reachable, but no communication can be established (because the process using that reattach address is not an executor)
go-plugin decides to kill the process(which is not managed by nomad) using reattach PID information

I hope the information above is sufficient to identify the root cause of the problem. If more details are needed, we would be happy to provide them.

tgross commented 1 week ago

Thanks @cenk-saldirim. That scenario is exactly what https://github.com/hashicorp/go-plugin/pull/320 was supposed to have fixed, which is particularly strange. Also, just a heads up that my colleague @mismithhisler is taking over the investigation from our end.

cenk-saldirim commented 1 week ago

Thanks for the information @tgross. I believe, the only difference between previous issue and the current one is the reattach address of the executor process which has exited is actively being used by another process. If that reattach address:port is not actively listening, go-plugin does not attempt to kill any process.

tgross commented 1 week ago

Ah! Sorry, I completely missed that... that complicates things quite a bit!

mismithhisler commented 1 week ago

Hi @cenk-saldirim, thanks for all the information. Do you have any more logs you could attach here showing this most recent issue? The "plugin failed to exit gracefully" is helpful, but we might be able to narrow this search with more.

Also, can you confirm the clients that are having this issue are only running raw_exec tasks? And not running any external task drivers (i.e. Podman etc)?

mismithhisler commented 1 week ago

@cenk-saldirim Just following up to my previous comment to let you know we can reproduce this and are working on a fix. Thanks again for the info you provided.

cenk-saldirim commented 1 week ago

Hi @mismithhisler, thank you very much for your finding and glad to hear that you can reproduce this. I can share the logs we collected while observing this behavior, if still needed.

mismithhisler commented 1 week ago

I think we should be good on logs for now, thank you though!

hashicorp / nomad