hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.82k stars 1.94k forks source link

nomad_client_allocs_memory_kernel_usage metric missing #19642

Closed gmichalec-pandora closed 6 days ago

gmichalec-pandora commented 8 months ago

Nomad version

1.6.2-ent

Operating system and Environment details

Debian buster

Issue

It seems that at some point, nomad stopped exporting a metric for nomad_client_allocs_memory_kernel_usage - the metric remains documented - https://developer.hashicorp.com/nomad/docs/operations/metrics-reference#allocation-metrics - but we don't see it reported from any metrics endpoint. Granted kernel memory usage is generally tiny, but we have many dashboard and alerts set up that include this metric in a total summary of memory used by a task, like this:

max(
    nomad_client_allocs_memory_rss{namespace="pangea", exported_job="pangea"} 
    + nomad_client_allocs_memory_cache{namespace="pangea", exported_job="pangea"}
    + nomad_client_allocs_memory_kernel_usage{namespace="pangea", exported_job="pangea"}
) by (namespace, exported_job, task_group, task) / on (namespace, exported_job, task_group, task)

Currently, since the metric is missing, all of those dashboards/alerts are not reporting. Rewriting them to exclude that metric would be daunting (they are managed by our 1000s of users) We are considering exporting that metric ourselves with a value of 0 as a workaround, but obviously that's not ideal

jrasell commented 8 months ago

Hi @gmichalec-pandora and thanks for raising this issue.

The nomad.client.allocs.memory.kernel_usage metric along with others is not exposed by cgroup v2; are you able to confirm which version your clients are using (the output from my lab can be seen below)?

$ mount | grep group
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot) 

While we do have a documentation note that client metrics are dependent on the task driver and version of cgroup, I wonder if we need to be more explicit in cases like this, where we know the differences between v1 &v2 exposed metrics?

gmichalec-pandora commented 8 months ago

Hi - Thanks for looking into this! So, it seems that our clients running debian bullseye are indeed running cgroup v2, as confirmed by the mount command and the client property unique.cgroup.version (with a mountpoint of /sys/fs/cgroup) however, the majority of our client nodes are still on debian buster, and there the client property reports as v1, with a mountpoint of /sys/fs/cgroup/systemd mount output on the buster nodes is

gmichalec@sv7-corp-docker10:~$  mount | grep group
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
jrasell commented 8 months ago

Hi @gmichalec-pandora and thanks for coming back with additional information. When exporting client allocation metrics we only set non-zero gauges which originates from #10376. It might therefore be possible that the tasks are not using any kernel memory which gets caught within this check. When this happens, the metric will be garbage collected from the reporting library and no longer show up because Nomad does not pre-define metrics.

Are you able to interrogate a cgroup v1 memory stats file and share the details of any kernel usage values?

jrasell commented 6 days ago

Closing as there has been no response to my previous comment. If you continue to see this issue, please feel free to re-open this issue, or open a new one.