Nomad metric nomad.client.allocs.memory.rss in docker with cgroup v2 - unavailable

Alexsandr-Random commented 7 months ago

Nomad version

Any

Operating system and Environment details

NAME="Ubuntu" VERSION="22.04 LTS"

Issue

When we using cgroups v2 nomad agent stops sending some metrics for Prometheus. Most important for us is nomad.client.allocs.memory.rss When cgroups set to v1 - everything scraping correctly. I do researches on 3 different independent nomad clusters so i could say this looks like bug.

Reproduction steps

Start some docker job on latest versions on ubuntu when cgroups v2 enabled by default and enable telemetry stanza on nomad client to send all metrics to Prometheus. And try to check if nomad.client.allocs.memory.rss scraping correctly
Then start the same job with cgroups v1 and you will get again nomad.client.allocs.memory.rss

Expected Result

nomad.client.allocs.memory.rss - scraping from any ubuntu version and from any cgroups version.

Actual Result

nomad.client.allocs.memory.rss - scraping only from cgroups v1.

jrasell commented 7 months ago

Hi @Alexsandr-Random and thanks for raising this issue.

The memory implementation of cgroups v2 does not expose RSS and is therefore not supported by the Docker driver. The difference between the memory statistics available within the Docker driver can be seen here.

It seems there might an equivalent when interrogating the cgroup v2 memory files, however, I am unsure if it would be possible to plumb this through to the Docker driver. This is because it does not have any direct understanding of the isolation which is left to Docker itself. I'll keep this issue open for future readers and mark it that it requires further investigation.

Alexsandr-Random commented 7 months ago

@jrasell Thanks for quick response! So for now the only solution to get metrics like memory rss on latest distros is manually downgrade cgroups v2 --> v1?

tgross commented 7 months ago

So for now the only solution to get metrics like memory rss on latest distros is manually downgrade cgroups v2 --> v1?

I looked into what the Docker API is exposing to us. The API docs doesn't exactly match what their CLI does (ref https://github.com/moby/moby/issues/45727 https://github.com/moby/moby/issues/45739) unfortunately.

stats API example

``` $ curl -s --unix-socket /run/docker.sock "http://localhost/containers/14b9ea15.../stats?stream=false&oneshot=true" | jq .memory_stats { "usage": 1441792, "stats": { "active_anon": 4096, "active_file": 8192, "anon": 151552, "anon_thp": 0, "file": 970752, "file_dirty": 0, "file_mapped": 884736, "file_writeback": 0, "inactive_anon": 147456, "inactive_file": 962560, "kernel_stack": 16384, "pgactivate": 2, "pgdeactivate": 0, "pgfault": 1504, "pglazyfree": 0, "pglazyfreed": 0, "pgmajfault": 237, "pgrefill": 0, "pgscan": 0, "pgsteal": 0, "shmem": 0, "slab": 232632, "slab_reclaimable": 114728, "slab_unreclaimable": 117904, "sock": 0, "thp_collapse_alloc": 0, "thp_fault_alloc": 0, "unevictable": 0, "workingset_activate": 0, "workingset_nodereclaim": 0, "workingset_refault": 0 }, "limit": 67078340608 } ```

But if I take a look at the cgroup for the container I just queried, I can see that memory.current maps directly to what Docker calls "usage".

$ curl -s --unix-socket /run/docker.sock "http://localhost/containers/14b9ea15.../stats?stream=false&oneshot=true" | jq .memory_stats.usage
1441792

$ sudo cat /sys/fs/cgroup/system.slice/docker-14b9ea15....scope/memory.current
1441792

My understanding from the kernel docs is that memory.current is everything and there's just a much more fine-grained set of stats available. We could probably expose some of those stats and maybe look into whether we can get the exact combination of items that adds up to the coarse "RSS" stat folks are used to.

In the meantime, you can derive the rough equivalent of RSS by subtracting Cache and Swap from Usage.

tgross commented 6 months ago

Alexsandr-Random commented 5 months ago

Yes, I can confirm what is written in https://github.com/hashicorp/nomad/issues/19604.

Nomad indeed displays accurate metrics (nomad_client_allocs_memory_usage) only when using cgroups v2 or higher.

If cgroups = v1, the correct memory percentage metrics would be the ratio of

nomad_client_allocs_memory_rss / nomad_client_allocs_memory_allocated.

Otherwise, you may observe behavior where memory is either immediately ~= 100% or it may appear to be a memory leak (on graphs), although that is not the case.

When using the expression below you might encounter inaccurate representations.

nomad_client_allocs_memory_usage / nomad_client_allocs_memory_allocated,

hashicorp / nomad