raw_exec plugin windows: "logmon: Unrecognized remote plugin message"

cattuz commented 2 years ago

Nomad version

Nomad v1.2.2 (78b8c171a211f967a8b297a88a7e844b3543f2b0)

Operating system and Environment details

Windows Server 2016 on an Azure VM

Issue

I get the following message intermittently when using raw_exec plugin on windows:

logmon: Unrecognized remote plugin message: This usually means that the plugin is either invalid or simply needs to be recompiled to support the latest protocol.

It happens multiple times in a row, and only appears to be resolved when switching the job to another VM. Other identical allocations can start after an allocation fails like this, so it seems like it's something in the plugin itself that's not working.

bild

Reproduction steps

Unable to reproduce reliably.

Expected Result

No plugin error .

Actual Result

Plugin error.

Job file (if appropriate)

I've tried this:

task "example" {
  driver = "raw_exec"

  artifact {
    source      = "..."
    destination = "${NOMAD_ALLOC_DIR}\\services"
  }

  resources {
    cpu    = 100
    memory = 100
    memory_max = 200
  }

  config {
    command = "${NOMAD_ALLOC_DIR}\\services\\example.exe"
  }
}

and wrapped in a script:

task "example" {
  driver = "raw_exec"

  artifact {
    source      = "..."
    destination = "${NOMAD_ALLOC_DIR}\\services"
  }

  resources {
    cpu    = 100
    memory = 100
    memory_max = 200
  }

  template {
    destination     = "local/run.ps1"
    left_delimiter  = "%%"
    right_delimiter = "%%"
    data            = <<EOH
$exe = "%% env "NOMAD_ALLOC_DIR" %%\services\example.exe"
$p = Start-Process $exe -ArgumentList $args -Wait -NoNewWindow -PassThru
Exit $p.ExitCode
EOH
  }

  config {
    command = "C:/Windows/System32/WindowsPowerShell/v1.0/powershell.exe"
    args = ["-File", "local/run.ps1"]
  }
}

Both with the same result.

Nomad Server logs (if appropriate)

No errors or warnings in server logs.

Nomad Client logs (if appropriate)

No errors or warnings in client logs.

Amier3 commented 2 years ago

Hey @cattuz

This issue has come up a few times,but we haven't found a solid root cause yet. One of the recent users to encounter this with the same plugin noted:

After about 50-60 allocation, we receive this error message and no more allocation can be scheduled on that specific client. You mentioned here that this could be linked to the number of allocations running on the client, so I tried manually removing some allocations and the failed ones eventually succeeded to run. When I tried bringing back those allocations I have stopped, they failed to run with the same error message.

Based on your # of allocations, do you think this is the bug you're experiencing?

cattuz commented 2 years ago

The allocation counts are as follows:

My experience is that it has to do with how long nomad or the allocations have been running. There is no problem starting up around 40 allocations on a fresh VM, but when an allocation gets scheduled on the same VM after a few hours it fails with the logmon error.

I'm running nomad as a windows service on the VM. Would using a scheduled task or something else be better?

Edit: I will be trying out a scheduled task restarting the nomad service hourly to see if anything changes: schtasks /create /tn "Restart_Nomad" /tr "cmd /C 'sc stop Nomad && sc start Nomad'" /sc hourly /ru system /rl highest

Edit 2: No luck with the periodic restart. So it seems it might be to with the number of allocations rather than some memory/resource leak. Maybe it has something to do with windows service sub process limits? https://stackoverflow.com/questions/17472389/how-to-increase-the-maximum-number-of-child-processes-that-can-be-spawned-by-a-w

I will try increasing the subprocess limit and see if that changes anything.

cattuz commented 2 years ago

Indeed the windows "desktop heap" seems to have been the problem. More discussion in the service limits stackoverflow thread.

After increasing service heap limit from 768 to 4096 with the following to my VM setup script...

# Increase service heap limit, controlling how many subprocesses a service can spawn
$heapLimits = "%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,4096 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16"
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems" -Name "Windows" -Value $heapLimits

...all the logmon issues have gone away. It does seem quite logical in hindsight, although I've never encountered those windows service limits before.

There seem to be some caveats, like the change affecting all services running on the VM potentially making them consume more resources, but so far I have noted no adverse effects.

hashicorp / nomad