hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.78k stars 1.94k forks source link

Windows -> Raw exec = Broken with 1.8.2 #23668

Open guifran001 opened 1 month ago

guifran001 commented 1 month ago

Nomad version

Nomad v1.8.2 BuildDate 2024-07-16T08:50:09Z Revision 7f0822c1e4f25907d9f60e2d595411950dd1bd28

Operating system and Environment details

Windows driver: raw_exec

Issue

Jobs are stopping unexpectedly in nomad, but the processes are still alive in Windows, so when nomad tries to restart them, it fails because the port of application is already bound or a lock file is already lock.

It might not be 100% reproductible as we have an automated test that deploys a nomad instance and starts some jobs that never failed in 1.7.3, but now is failing since our update to 1.8.2. The test has been successful once with 1.8.2.

The automated test deploy consul 1.17.2 as both server and client, then nomad 1.8.2 as both server and client, then starts a postgres job and a keycloak job.

From the logs, we can see that postgres and keycloak start, then postgres stops and keycloak stops with Exit Code 2. Then they are stuck in a restart loop as the orignial postgres and keycloak processes are still executing.

From the log, we can see the Exit Code 2 that is usually reserved to file not found error. Could it be related to #23595 ?

Reproduction steps

80%. The test is using pulumi to run the jobs. Each job is run sequentially

Expected Result

postgres and keycloak run fine

Actual Result

postgres and keycloak are stuck in a restart loop while a detached process of postgres and one from keycloak started by nomad are still running while nomad try to start new ones.

Job file (if appropriate)

Will be provided if needed.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

POSTGRES Jobs is scheduled: 
2024-07-19T17:47:48.093Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Received msg="Task received by client" failed=false
2024-07-19T17:47:48.110Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type="Task Setup" msg="Building Task Directory" failed=false
2024-07-19T17:47:48.353Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres path=c:\data\nomad-1.8.2\alloc\79de797f-3e61-54a6-ce48-2affc507d636\postgres\secrets\api.sock error="listen unix c:\data\nomad-1.8.2\alloc\79de797f-3e61-54a6-ce48-2affc507d636\postgres\secrets\api.sock: bind: invalid argument"
POSTGRES task starts
2024-07-19T17:47:48.385Z [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:/install/00Postgres-v16.1.13-update-manifest-versions.1/pgsql/bin/postgres.exe Args:[-p 5432] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-07-19T17:47:48.597Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Started msg="Task started by client" failed=false

KEYCLOAK Jobs is scheduled
2024-07-19T17:49:04.930Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Received msg="Task received by client" failed=false
2024-07-19T17:49:04.931Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type=Received msg="Task received by client" failed=false
2024-07-19T17:49:04.948Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type="Task Setup" msg="Building Task Directory" failed=false
2024-07-19T17:49:05.158Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak path=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\compile-keycloak\secrets\api.sock error="listen unix c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\compile-keycloak\secrets\api.sock: bind: invalid argument"
KEYCLOAK-COMPILE task starts (lifecycle {hook = "prestart"sidecar = false}
2024-07-19T17:49:05.177Z [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:/install/openjdk-v21.0.2/bin/java.exe Args:[-Dkc.config.build-and-exit=true -Dprogram.name=keycloak -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:FlightRecorderOptions=stackdepth=512 --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.security=ALL-UNNAMED -Dkc.home.dir=C:/install/keycloak-v23.0.6 -Djboss.server.config.dir=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\compile-keycloak\local/conf -Dkeycloak.theme.dir=C:/install/keycloak-v23.0.6/themes -Djava.util.logging.manager=org.jboss.logmanager.LogManager -Dquarkus-log-max-startup-records=10000 -Dquarkus.datasource.jdbc.acquisition-timeout=2000 -cp C:/install/keycloak-v23.0.6/lib/quarkus-run.jar;C:/install/keycloak-v23.0.6/lib/**/* io.quarkus.bootstrap.runner.QuarkusEntryPoint build --db=postgres] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-07-19T17:49:05.404Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Started msg="Task started by client" failed=false
KEYCLOAK-COMPILE task ends
2024-07-19T17:49:23.430Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Terminated msg="Exit Code: 0" failed=false
2024-07-19T17:49:23.438Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unknown desc = executor failed to find process: OpenProcess: The parameter is incorrect."
2024-07-19T17:49:23.464Z [INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 driver=raw_exec task_name=compile-keycloak plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=4688
2024-07-19T17:49:23.465Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak reason="Restart unnecessary as task terminated successfully"
2024-07-19T17:49:23.479Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type="Task Setup" msg="Building Task Directory" failed=false
2024-07-19T17:49:23.626Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak path=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\keycloak\secrets\api.sock error="listen unix c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\keycloak\secrets\api.sock: bind: invalid argument"
KEYCLOAK task starts
2024-07-19T17:49:23.642Z [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:/install/openjdk-v21.0.2/bin/java.exe Args:[-Dkc.config.built=true -Dprogram.name=keycloak -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:FlightRecorderOptions=stackdepth=512 --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.security=ALL-UNNAMED -Dkc.home.dir=C:/install/keycloak-v23.0.6 -Djboss.server.config.dir=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\keycloak\local/conf -Dkeycloak.theme.dir=C:/install/keycloak-v23.0.6/themes -Djava.util.logging.manager=org.jboss.logmanager.LogManager -Dquarkus-log-max-startup-records=10000 -Dquarkus.datasource.jdbc.acquisition-timeout=2000 -cp C:/install/keycloak-v23.0.6/lib/quarkus-run.jar;C:/install/keycloak-v23.0.6/lib/**/* io.quarkus.bootstrap.runner.QuarkusEntryPoint start --optimized --db=postgres --db-url jdbc:postgresql://127.0.0.1:5432/keycloak?user=iac&password=xxx OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-07-19T17:49:23.761Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type=Started msg="Task started by client" failed=false

!!! POSTGRES stops with Exit code 0 ??? 
2024-07-19T17:50:51.551Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50071->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-07-19T17:50:51.569Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14000: connectex: No connection could be made because the target machine actively refused it.\""

!!! KEYCLOAK stops with Exit code 0 ???
2024-07-19T17:50:51.589Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50272->127.0.0.1:14001: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-07-19T17:50:51.619Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14001: connectex: No connection could be made because the target machine actively refused it.\""

!!! Executor de POSTGRES fails with Exit code 2 ??? (does 2 mean "The system cannot find the file specified"?)
2024-07-19T17:50:51.714Z [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 driver=raw_exec task_name=postgres plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=2184 error="exit status 2"
2024-07-19T17:50:51.714Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres reason="Restart within policy" delay=18.647683748s
2024-07-19T17:50:51.715Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Restarting msg="Task restarting in 18.647683748s" failed=false

Executor de KEYCLOAK fails with Exit code 2 ??? (does 2 mean "The system cannot find the file specified"?)
2024-07-19T17:50:51.783Z [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 driver=raw_exec task_name=keycloak plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=1568 error="exit status 2"
2024-07-19T17:50:51.783Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak reason="Restart within policy" delay=34.485843068s
tgross commented 1 month ago

Hi @guifran001! I can't really figure out why your postgres process exited, but it looks like the shutdown of the executor is the issue here. When the compile-keycloak pre-start task finishes, we see the following logs:

2024-07-19T17:49:23.430Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Terminated msg="Exit Code: 0" failed=false 2024-07-19T17:49:23.438Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unknown desc = executor failed to find process: OpenProcess: The parameter is incorrect." 2024-07-19T17:49:23.464Z [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 driver=raw_exec task_name=compile-keycloak plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=4688

This error is coming from executor.go#L612. To hit this code at all, the executor still has the process PID it's kept around in memory, but when we call os.FindProcess that hits this code in the Go stdlib, which is just a Windows syscall:

func findProcess(pid int) (p *Process, err error) {
    const da = syscall.STANDARD_RIGHTS_READ |
        syscall.PROCESS_QUERY_INFORMATION | syscall.SYNCHRONIZE
    h, e := syscall.OpenProcess(da, false, uint32(pid))
    if e != nil {
        return nil, NewSyscallError("OpenProcess", e)
    }
    return newProcess(pid, uintptr(h)), nil
}

Is there any possibility that your task daemonizes? That is, does it double fork and then closes the first forking process to divorce the child from the grandparent (the Nomad executor in this case)? If not, then I'm not sure what's up... you're running Nomad as a privileged user, right?

Could it be related to #23595 ?

That's an error trying to access Linux cgroups, so highly unlikely to be related.

guifran001 commented 1 month ago

@tgross : thank you for the reply.

Our test is not starting Nomad as a privileged user but only as the Network Service User.

We haven't taken the time to properly investigate yet. But we'll give it a try with a privileged user.

From OpenProcess function (processthreadsapi.h) doc:

If the specified process is the System Idle Process (0x00000000), the function fails and the last error code is ERROR_INVALID_PARAMETER. If the specified process is the System process or one of the Client Server Run-Time Subsystem (CSRSS) processes, this function fails and the last error code is ERROR_ACCESS_DENIED because their access restrictions prevent user-level code from opening them.

So it it more likely than the pid is 0. From the code, I cannot figure out how this would happen and what's different from version 1.7.x about that.

tgross commented 1 month ago

I've found that Windows returns ERROR_INVALID_PARAMETER at the drop of a hat, but that's a good catch. The added bit of context is the sometimes-unfortunate decision that Go made to have "zero values" for uninitialized fields, even if they're value-fields. And guess what the zero-value of an PID (integer) is? So it could be that we're somehow no longer capturing the PID from the child process and so when we try to shutdown the executor we're passing it that uninitialized field.

I don't have an answer yet, but that gives me a place to start looking at least.

PayStreamArmitsteadJ commented 1 month ago

@tgross

I have been experiencing a very similar issue to @guifran001 so I will see if my situation can help shed any light on this issue.

Nomad version Nomad v1.8.2 BuildDate 2024-07-16T08:50:09Z Revision https://github.com/hashicorp/nomad/commit/7f0822c1e4f25907d9f60e2d595411950dd1bd28

Operating system and Environment details

Windows driver: raw_exec

Issue

I run long lived Nomad jobs that can live for days at a time as they are service jobs. These jobs are designed to only have one instance running at once.

I updated my Nomad clients from v1.6.7 to v1.8.2. The upgrade involved draining the clients, upgrading nomad and then setting the client to eligible. This was done one client at a time. My Nomad servers are running v1.6.7 and haven't been upgraded for a while.

For the first 3 hours after the upgrade everything was running smoothly with no issues. Approximately 3 hours after upgrading my Nomad clients to v1.8.2 I noticed that all of our allocations on one of the Nomad clients had been restarted and this prompted me to investigate further. During my investigation I realised that the original allocations were still running and the restart had essentially duplicated all the allocations on that client. When viewing running processes in task manager it was obvious that each allocation now had two instances running. I checked the Nomad UI and only the newest instance of each allocation was being displayed. Here is what I think happened based on the logs:

  1. The executor failed due to not being able to make a HTTP call to a local HTTP server.
  2. The executor was then supposed to be destroyed but the destroy failed.
  3. This caused the allocations already running on the client to become "forgotten" and Nomad no longer knows about them.
  4. Since Nomad thinks no allocations are running it attempts to start (restart) all the allocations.
  5. The old allocations are still running as well as the new ones hence why there are duplicate allocations

Here is how I fixed it:

  1. I drained the problematic Nomad client via the Nomad UI to move all the new allocations to another Nomad client.
  2. Once the drain was complete we manually killed all the "forgotten" allocations directly on the Nomad client via Task Manager.
  3. We then set the client to be eligible again.

This fixed the problem for now but if the problem reoccurs it will cause me serious issues.

Nomad Client logs

Logs from the time of the upgrade:

2024-08-05T15:08:47.679+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f task=task type=Killed msg="Task successfully killed" failed=false
2024-08-05T15:08:47.692+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f task=task plugin=E:\nomad\nomad.exe id=17564
2024-08-05T15:08:47.695+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f
2024-08-05T15:11:59.263+0100 [INFO]  agent: requesting shutdown
2024-08-05T15:11:59.263+0100 [INFO]  client: shutting down
2024-08-05T15:11:59.328+0100 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
2024-08-05T15:11:59.328+0100 [INFO]  client.plugin: plugin manager finished: plugin-type=device
2024-08-05T15:11:59.328+0100 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
2024-08-05T15:11:59.348+0100 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
2024-08-05T15:11:59.348+0100 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
2024-08-05T15:11:59.348+0100 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
2024-08-05T15:11:59.416+0100 [INFO]  agent: shutdown complete
2024-08-05T15:15:05.542+0100 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=E:\nomad\data\plugins
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  client: using state directory: state_dir=E:\nomad\data\client
2024-08-05T15:15:05.552+0100 [INFO]  client: using alloc directory: alloc_dir=E:\nomad\data\alloc
2024-08-05T15:15:05.552+0100 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=14000-14512
2024-08-05T15:15:05.649+0100 [INFO]  client.fingerprint_mgr.consul: consul agent is available: cluster=default
2024-08-05T15:15:05.714+0100 [WARN]  client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="landlock not supported on this platform"
2024-08-05T15:15:13.963+0100 [WARN]  client.fingerprint_mgr.network: failed to detect link speed: device="Loopback Pseudo-Interface 1" path=powershell.exe command="Get-NetAdapter -Name 'Loopback Pseudo-Interface 1' -ErrorAction Ignore | Select-Object -ExpandProperty 'Speed'" error="exit status 1"
2024-08-05T15:15:14.595+0100 [WARN]  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: The system cannot find the path specified."
2024-08-05T15:15:14.626+0100 [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
2024-08-05T15:15:24.659+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
2024-08-05T15:15:24.659+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
2024-08-05T15:15:24.659+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
2024-08-05T15:15:24.660+0100 [ERROR] client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.nomad.alloc_id%22%5D%7D\": open //./pipe/docker_engine: The system cannot find the file specified."
2024-08-05T15:15:24.673+0100 [WARN]  client.server_mgr: no servers available
2024-08-05T15:15:24.674+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:15:24.674+0100 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=5391c4b9-73e7-6943-8459-620ce812c13f/task/cae7b09d
2024-08-05T15:15:24.678+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a9d42f37-f5f8-590e-3df2-e6bb962b6b3b task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:15:24.678+0100 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=a9d42f37-f5f8-590e-3df2-e6bb962b6b3b/task/68e876f1
2024-08-05T15:15:24.680+0100 [INFO]  client: started client: node_id=6bbe1a22-1440-c71b-7d86-0d445a719a7d
2024-08-05T15:15:24.680+0100 [WARN]  client.server_mgr: no servers available
2024-08-05T15:15:24.681+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=a9d42f37-f5f8-590e-3df2-e6bb962b6b3b
2024-08-05T15:15:24.682+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f
2024-08-05T15:15:24.688+0100 [INFO]  client.consul: discovered following servers: servers=[172.20.34.40:4647, 172.20.34.41:4647, 172.20.34.42:4647]
2024-08-05T15:15:24.827+0100 [INFO]  client: node registration complete
2024-08-05T15:15:31.083+0100 [INFO]  client: node registration complete
2024-08-05T15:18:09.714+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d29f972f-0875-3084-31e8-f86f1d8342dd task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:18:09.740+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=8c94e150-617b-2203-f95f-16d3061993d6 task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:18:09.788+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=6873d78b-2c91-e5b8-64e6-acfcc8e64dbc task=task type=Received msg="Task received by client" failed=false

Logs from the time that the problems started:

Normal operation
2024-08-05T16:45:12.892+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=325bb45a-2676-d696-0192-b415768a1a90 task=task type=Killing msg="Sent interrupt. Waiting 30s before force killing" failed=false
2024-08-05T16:50:12.973+0100 [INFO]  client.gc: garbage collecting allocation: alloc_id=58b1cf38-c1d0-2984-1fc6-5e40b90db20f reason="forced collection"
2024-08-05T16:50:12.975+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=58b1cf38-c1d0-2984-1fc6-5e40b90db20f task=task type=Killing msg="Sent interrupt. Waiting 30s before force killing" failed=false

The start of the problems
2024-08-05T18:24:05.038+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 task=task type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:60955->127.0.0.1:14005: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-08-05T18:24:05.041+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 task=task type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:61608->127.0.0.1:14019: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-08-05T18:24:05.593+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14005: connectex: No connection could be made because the target machine actively refused it.\""
2024-08-05T18:24:05.593+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14019: connectex: No connection could be made because the target machine actively refused it.\""
2024-08-05T18:24:05.736+0100 [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 driver=raw_exec task_name=task plugin=E:\nomad\nomad.exe id=34452 error="exit status 2"
2024-08-05T18:24:05.754+0100 [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 driver=raw_exec task_name=task plugin=E:\nomad\nomad.exe id=16152 error="exit status 2"
2024-08-05T18:24:05.781+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 task=task reason="Restart within policy" delay=1m4.76052843s
2024-08-05T18:24:05.781+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 task=task reason="Restart within policy" delay=1m0.536694183s
2024-08-05T18:24:05.801+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 task=task type=Restarting msg="Task restarting in 1m0.536694183s" failed=false
2024-08-05T18:24:05.801+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 task=task type=Restarting msg="Task restarting in 1m4.76052843s" failed=false
2024-08-05T18:24:09.828+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a0f5d27f-d704-978b-8897-457936a481a0 task=task type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:61741->127.0.0.1:14024: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-08-05T18:24:09.903+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14024: connectex: No connection could be made because the target machine actively refused it.\""
2024-08-05T18:24:10.607+0100 [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=a0f5d27f-d704-978b-8897-457936a481a0 driver=raw_exec task_name=task plugin=E:\nomad\nomad.exe id=4772 error="exit status 2"
2024-08-05T18:24:10.613+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=a0f5d27f-d704-978b-8897-457936a481a0 task=task reason="Restart within policy" delay=1m0.536694183s
tgross commented 1 month ago

Hi @PayStreamArmitsteadJ. What's you're seeing isn't identical, but it sure smells similar. I'm not sure where you're getting this idea from:

The executor failed due to not being able to make a HTTP call to a local HTTP server.

The executor isn't going to be making any HTTP requests or outbound connections of any kind. The executor is a go-plugin server. So the Nomad client agent spawns the executor process and then connects to that process via either a unix domain socket or a local TCP connection on Windows.

When the client restarts (ex. after an upgrade), it reads its on-disk state to find tasks that should still be running, and then re-connects to the TCP server it expects to find there. If it can't reconnect, you'll see an error like the one you saw here:

2024-08-05T15:15:24.674+0100 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=5391c4b9-73e7-6943-8459-620ce812c13f/task/cae7b09d

At that point, the client says "ok this task must've failed while I was offline". That's just something Nomad has to expect could happen, so it restarts the task (or not) according to the policy in the restart block.

But if the executor task is missing, there's no way for the client to have any idea that the task process is running or not. We're supposed to always kill the child processes when the executor dies, for any reason.

And then that's compounded by what you're seeing here:

2024-08-05T18:24:05.593+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14005: connectex: No connection could be made because the target machine actively refused it.\""

In this case, it looks like the executor's TCP server is unavailable. I suspect in this case the executor process is gone already so we're just trying to send a message over a connection where nobody's listening anymore. Which is more evidence that the executor processes are getting killed unexpectedly.


A question for both @guifran001 and @PayStreamArmitsteadJ: when these task processes are left running, are their executor processes still running as well, or are the executor processes also gone?

PayStreamArmitsteadJ commented 1 month ago

@tgross

I have only experienced this issue once and it was affecting our production cluster, so I had to remedy the situation right away. Therefore, I do not know if the executor processes were still running. It seems like this was a one-off issue caused by the upgrade.

tgross commented 1 month ago

I have only experienced this issue once and it was affecting our production cluster, so I had to remedy the situation right away. Therefore, I do not know if the executor processes were still running. It seems like this was a one-off issue caused by the upgrade.

Ok, that's information in itself. Thanks.

ORuessel commented 3 weeks ago

Hello everybody, i have the same situation. We upgraded the nomad cluster in two steps from 1.5.2 to 1.7.3 to 1.8.2. In our situation we have "long live" running jobs over one day, At the end of the day, we shut down and suspend in AWS the client nodes. On the next morning sometimes we get the issue, that on one different node the process is still running and not fully shutdown by nomad.

Aug 21, '24 11:43:28 +0200  Killing Sent interrupt. Waiting 5s before force killing
Aug 21, '24 11:43:21 +0200  Restarting  Task restarting in 16s
Aug 21, '24 11:43:21 +0200  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:64431->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
Aug 21, '24 11:43:19 +0200  Started Task started by client
Aug 21, '24 11:43:01 +0200  Restarting  Task restarting in 18s
Aug 21, '24 11:43:01 +0200  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:64411->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
Aug 21, '24 11:43:00 +0200  Started Task started by client
Aug 21, '24 11:42:44 +0200  Restarting  Task restarting in 15s
Aug 21, '24 11:42:44 +0200  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:64400->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
Aug 21, '24 11:42:43 +0200  Started Task started by client

There was a old instance already running. After stop the process it was not possible to start the process in a normal way. only restart the hole machine was a temporary solution. The logfiles of nomad are crashed for today at this point i cannot provide at the moment and will monitor this situation and will deliver logs afterwards.

@tgross i have multiple times per week this issue, old instances on the servers are running with version 1.8.2. How can i check the executor process is still running ?