hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.83k stars 1.95k forks source link

Nomad client processes consume a lot of CPU on Windows Server #23323

Open Settler opened 3 months ago

Settler commented 3 months ago

Nomad version

Nomad v1.8.0 BuildDate 2024-05-28T17:38:17Z Revision 28b82e4b2259fae5a62e2ed47395334bea5a24c4

Operating system and Environment details

Windows Server 2019

Issue

Previously we reported issue #20042 which was fixed in 1.8.0. We ran another tests and found that problem with abnormal cpu consumption on Windows is still exists in 1.8.0. Today I present to you another graph where we can compare behaviour in nomad 1.6.10 and 1.8.0 image

Upper graph – Server with nomad v1.6.10 (which is under load of 81 working allocations) Bottom graph – Server with nomad v1.8.0 (clean server with new nomad version without working allocations) Red horizontal lines with spikes – 95th percentile of overall CPU on machine. Blue ares – CPU consumption sum of all nomad processes

And again we've tested different workloads (on each server at the same time):

As you can see, nomad 1.6.10 won again. Despite the fact that server with nomad 1.6.10 was under load of 81 allocations, when we added additional 200 allocations, it consumed less CPU than server with nomad 1.8.0. Both servers have same specs. And we clearly see that main CPU consumers of server with 1.8.0 were processes of nomad itself.

On Linux environment everything is okay.

Reproduction steps

Run example job file on Windows. Tweak alloc count variable to achieve ~75% overall CPU consumption.

Expected Result

CPU consumption of nomad processes same as 1.6.x version.

Actual Result

Inappropriate nomad CPU consumption.

Job file (if appropriate)

locals {
  allocs = 100
}

job "PSTest" {
  type = "service"

  group "PSTest-Group" {
    count = "${local.allocs}"

    task "PSTest" {
      driver = "raw_exec"
      kill_timeout = "10s"

      config {
        command = "powershell"
        args = ["./test.ps1"]
      }

      resources {
        cpu = 50
        memory = 100
      }        

      template {
        data = <<EOH

for (($i = 0); $i -lt 3600; $i++)
{
    Write-Host $i
    Start-Sleep -Second 180
}
                EOH
        destination = "./test.ps1"
        change_mode = "noop"
      }
    }
  }
}
tgross commented 3 months ago

Hi @Settler! Just out of curiosity, do you see the same results in Nomad 1.8.x when the job doesn't have a template block? We made some fairly large changes to template rendering on Windows which we're likely to back out very soon. But if those are the source of heavy CPU utilization that would help me make the case to do the work even sooner.

Settler commented 3 months ago

Hi @tgross! We've ran sample job without template block (placed ps file on the server) and didn't see any difference. image This graph shows 100 and 200 allocations on the server.

tgross commented 3 months ago

Ok, sorry to hear that wasn't the case. I'll mark this issue for further investigation.