hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

`nomad.client.allocs.memory.usage` reports 0 for jobs with raw_exec driver #9073

Open fredwangwang opened 3 years ago

fredwangwang commented 3 years ago

Nomad version

Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)

Operating system and Environment details

This happens on both windows nodes and linux nodes

Issue

Nomad reports wrong (0) nomad.client.allocs.memory.usage metric for jobs with raw_exec driver

Reproduction steps

deploy:

job "fake-joblinux" {
  region      = "global"
  datacenters = [ "some-dc" ]

  type        = "service"

  update {
    stagger      = "10s"
    max_parallel = 1
  }

  group "fake-service-api-grouplinux" {
    count = 1
    constraint {
      attribute = "${attr.kernel.name}"
      value     = "linux"
    }

    task "fake-api" {
      driver = "raw_exec"

      config {
        command = "ping"
        args = ["8.8.8.8"]
      }

      resources {
        cpu    = "200"
        memory = "200"
      }
    }
  }
}

and see http://host-with-above-alloc/v1/metrics:

{
"Labels": {
"alloc_id": "90025118-3928-68ad-d556-20633856ff52",
"task": "fake-api",
"namespace": "default",
"job": "fake-joblinux",
"task_group": "fake-service-api-grouplinux"
},
"Name": "nomad.client.allocs.memory.usage",
"Value": 0
},

Job file (if appropriate)

see above

tgross commented 3 years ago

Hi @fredwangwang I was able to verify this on the current master on Linux as well. The following slightly more minimal job will demonstrate this against a dev-mode Nomad:

job "example" {
  datacenters = [ "dc1" ]

  group "group" {

    task "task" {
      driver = "raw_exec"

      config {
        command = "ping"
        args = ["8.8.8.8"]
      }

      resources {
        cpu    = 200
        memory = 200
      }
    }
  }
}
Resulting metrics output ```json { "Labels": { "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default" }, "Name": "nomad.client.allocs.cpu.allocated", "Value": 200 } { "Labels": { "namespace": "default", "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task" }, "Name": "nomad.client.allocs.cpu.system", "Value": 0.016665318980813026 } { "Labels": { "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default", "host": "linux" }, "Name": "nomad.client.allocs.cpu.throttled_periods", "Value": 0 } { "Labels": { "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default", "host": "linux" }, "Name": "nomad.client.allocs.cpu.throttled_time", "Value": 0 } { "Labels": { "task": "task", "namespace": "default", "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f" }, "Name": "nomad.client.allocs.cpu.total_percent", "Value": 0.016665318980813026 } { "Labels": { "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default" }, "Name": "nomad.client.allocs.cpu.total_ticks", "Value": 0.38380229473114014 } { "Labels": { "namespace": "default", "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task" }, "Name": "nomad.client.allocs.cpu.user", "Value": 0 } { "Labels": { "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default", "host": "linux", "job": "example" }, "Name": "nomad.client.allocs.memory.allocated", "Value": 209715200 } { "Labels": { "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default" }, "Name": "nomad.client.allocs.memory.cache", "Value": 0 } { "Labels": { "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default", "host": "linux" }, "Name": "nomad.client.allocs.memory.kernel_max_usage", "Value": 0 } { "Labels": { "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default", "host": "linux" }, "Name": "nomad.client.allocs.memory.kernel_usage", "Value": 0 } { "Labels": { "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default", "host": "linux", "job": "example" }, "Name": "nomad.client.allocs.memory.max_usage", "Value": 0 } { "Labels": { "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task", "namespace": "default", "host": "linux" }, "Name": "nomad.client.allocs.memory.rss", "Value": 33705984 } { "Labels": { "namespace": "default", "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f", "task": "task" }, "Name": "nomad.client.allocs.memory.swap", "Value": 0 } { "Labels": { "task": "task", "namespace": "default", "host": "linux", "job": "example", "task_group": "group", "alloc_id": "ccfbfe98-911c-0c10-2cdc-af8bac62820f" }, "Name": "nomad.client.allocs.memory.usage", "Value": 0 } ```

I've also verified that this appears to be working as expected with the exec driver, and that the particular workload doesn't seem to matter (note we do see metrics like rss). One interesting thing here is that we have a nightly end-to-end test covering exactly this metric on both Linux and Windows that passes. (ref https://github.com/hashicorp/nomad/tree/master/e2e/metrics)

tgross commented 3 years ago

One interesting thing here is that we have a nightly end-to-end test covering exactly this metric on both Linux and Windows that passes.

I took a second look at that and I'm realizing we don't actually test that the results are non-zero though. So that might be where it's slipping through. I'll dig into this further.

dbkukku commented 3 years ago

Hey. I am also seeing discrepancies between values for the actual CPU and memory utilization of the exe on the node vs the values shown on the Nomad UI.

tgross commented 3 years ago

@dbkukku can you open a new issue for that explaining what you're seeing in more detail? That seems like it's a different problem.

fredwangwang commented 3 years ago

I dug a little bit deeper and I believe I found the issue.

The flow approximately is:

  1. task runner asking for resource usage from driver
  2. driver asking for resource usage from its drive (docker, java, etc).

Since the resource usage is reporting correctly for docker and exec for example, it has to be an issue in raw_exec driver. Which leads me into this line here and here:

ExecutorBasicMeasuredMemStats = []string{"RSS", "Swap"}
ms.RSS = memInfo.RSS
ms.Swap = memInfo.Swap
ms.Measured = ExecutorBasicMeasuredMemStats

The raw_exec only expose those two metrics. It makes sense since raw_exec depends greatly on the underlying OS to provide the metrics, which can be hugely different.

But since whats been measured is propagated back to the task runner by setting Measured []string here, I think what probably makes more sense to me is to only show the metrics that are actually collected, instead of showing 0.

@tgross could you provide some thoughts on this? Thanks!

tgross commented 3 years ago

Hi @fredwangwang looks like that's exactly it!

I compared the results and code path of the raw_exec driver to the exec driver running the exact same workload. When we collect the stats from exec, we're hitting the libcontainer path in executor_linux.go#L353-L363, whereas in raw_exec we only ever hit the pid collector in the "universal executor" in executor.go#L595-L599.

The pid_collector calls into the gopsutil library 's MemoryInfo. I tried to see if we could get that additional data out of that pid collector or derive it somehow. But it looks like that gopsutil is throwing away the platform-specific "extended" memory stats available from fillFromStatmWithContext, which is probably just as well because there's a bug there. šŸ˜ (https://github.com/shirou/gopsutil/issues/277)

I think you're right that the best approach would be to not report the metrics that we're not collecting. It looks like the gauges are being written in the client, in the task runner: task_runner.go#L1296-L1323. Next step is to figure out why that Measured field isn't being referenced before writing those.

fredwangwang commented 3 years ago

Thanks for following up @tgross!

I think we should be able to work with the metrics (RSS specifically) that we are getting, but still needs to understand that a bit more.

The confusion was mainly around that certain metrics reporting 0 where it should really just not showing up IMO. Looking forward to seeing it getting fixed or hearing more about why it is kept to report 0! šŸ˜€

tgross commented 3 years ago

I had a conversation with folks here internally and that section of task_runner should be checking and it just isn't. I'm going to double-check that there's no unexpected behavior in fixing it with the other task drivers. But otherwise should be a small fix.

fredwangwang commented 3 years ago

Thanks @tgross!

A small note, the docs https://www.nomadproject.io/docs/telemetry/metrics#allocation-metrics probably needs to be updated as well to call out that memory metrics emitted depends on the driver type a task uses

fredwangwang commented 3 years ago

Just for reference: memory.usage is the reporting by cgroups (resource container), thats why it not available for raw_exec driver type. memroy.usage == RSS (actual mem usage) + CACHE (page cache) + SWAP.

Using memory.rss instead to get the actual mem usage, which is also the number used by nomad in its UI.

it-ben commented 2 months ago

any update on this? Would love to have that available.

jrasell commented 2 months ago

Hi @it-ben there are no updates currently. When an engineer is assigned to this and working on it, updates will be provided in the issue.