hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

Memory metrics for raw_exec tasks are missing #14490

Open mr-karan opened 2 years ago

mr-karan commented 2 years ago

Nomad version

Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56)

Operating system and Environment details

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04 LTS
Release:    22.04
Codename:   jammy

Issue

For tasks running with raw_exec task driver, memory related metrics aren't being exported. Is this an expected behaviour?

Reproduction steps

Run a task as raw_exec:

job "sleep" {
  datacenters = ["dc1"]
  type        = "service"

  group "app" {
    count = 1
    network {
      mode = "bridge"
      port "python-http" {
        to = "8888"
      }
    }

    task "app" {
      driver = "raw_exec"

      config {
        command = "bash"
        args    = ["-c", "sleep infinity"]
      }
    }
  }
}

When querying for metrics, notice that nomad_client_allocs_memory_usage is not exported for this task. If the task driver is changed to exec then memory related metrics start showing up.

However, this isn't the case for CPU related metrics. For eg nomad_client_allocs_cpu_total_ticks works for raw_exec

shoenig commented 2 years ago

Hi @mr-karan, thanks for the issue!

Yes, this is expected behavior. What's happening is raw_exec makes use of /proc/<pid>/statm to lookup memory statistics of each individual PID associated with the process group of your task. The value we can get from here is the RSS, which we sum together and report as nomad_client_allocs_memory_rss. As you can imagine there's room for error in this methodology - we have to build up the process tree by manually, looking up RSS values individually, etc. leaving room for racy / inaccurate values.

In contrast, exec leverages cgroups to atomically ask the kernel for the memory usage of the whole process group by reading /sys/fs/cgroup/nomad.slice/<group>/memory.stat (using cgroups v2). IIRC the value reported here includes cached memory, and as such is labeled "usage" instead of strictly just RSS.

All that being said, we could enhance the raw_exec driver to also report the cgroups-based memory value, when running on a Linux machine. Is that something you feel would be helpful?

mr-karan commented 2 years ago

Is that something you feel would be helpful?

I think it'a a helpful metric to export, yes. I've a namespace where I deploy jobs with raw_exec and exec. I am plotting the values on Grafana where I am visualising memory usage for each namespace. So by default, I get an inaccurate graph because jobs with raw_exec are being excluded.

I noticed that when we query for nomad alloc staus -stats <alloc-id>, the RSS value is shown here.

Task "app" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  45 MiB/300 MiB  300 MiB  

Memory Stats
RSS     Swap
45 MiB  0 B

Maybe for consistency sake we could export the metric as well, or cgroups approach also sounds fine to me.