docker tasks with cgroups v2 report combined RSS + cache for memory usage

cberescu commented 1 year ago

Nomad version

Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)

Operating system and Environment details

NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)"

Issue

The reporting of memory use by docker containers is not correctly shown. The issue i think is from : https://github.com/moby/moby/issues/10824#issuecomment-292778896

Reproduction steps

Create a docker job with an app that creates / read a lot of files . Like a image cache app

Expected Result

Show correct memory utilization

Actual Result

It shows memory including the cache.

tgross commented 1 year ago

Hi @cberescu! Let me set a bit of context here...

Nomad's reporting of memory usage comes from the task driver, which is mapping the data we get back from Docker itself. You can see how that's done in stats_posix.go#L24-L38:

ms := &cstructs.MemoryStats{
    RSS:        s.MemoryStats.Stats.Rss,
    Cache:      s.MemoryStats.Stats.Cache,
    Swap:       s.MemoryStats.Stats.Swap,
    MappedFile: s.MemoryStats.Stats.MappedFile,
    Usage:      s.MemoryStats.Usage,
    MaxUsage:   s.MemoryStats.MaxUsage,
    Measured:   measuredMems,
}

The docker stats command docs say this:

On Linux, the Docker CLI reports memory usage by subtracting cache usage from the total memory usage. The API does not perform such a calculation but rather provides the total memory usage and the amount from the cache so that clients can use the data as needed. The cache usage is defined as the value of total_inactive_file field in the memory.stat file on cgroup v1 hosts.

On Docker 19.03 and older, the cache usage was defined as the value of cache field. On cgroup v2 hosts, the cache usage is defined as the value of inactive_file field.

We then map these to metrics in the client's setGaugeForMemory method. And then the values that get reported in places like the UI charts are from the Read Allocation Statistics API and we use the RSS to display memory usage, because cache memory isn't exclusive to the allocation and can be reclaimed by the OS at any time.

So given all that context :grinning: , where do you feel there's a gap in how the memory is being reported? Are you seeing discrepancy between what Docker reports and what Nomad reports? A gap between what nomad alloc status reports and the UI? Or is it that you just want to see the cache value reported somewhere?

cberescu commented 1 year ago

Hi @tgross ,

Thanks for explaining it a little more.

What i would love to see is the the real memory used without the cache . I think dockers s.MemoryStats.Usage is with the cache included.

So I can see 2 ways (I am referring here in the UI)

either substract the cache from the usage and display that.
add a cache chart

what do you think ?

tgross commented 1 year ago

Ok, I see what you're saying. On cgroups v2 we don't get the RSS value, so we only get Docker's Usage, which combines the cache. That's frustrating and a weird API choice on their part. For example, on a task that's using just over 1MiB of RSS I get the following from the API:

$ curl -s "localhost:4646/v1/client/allocation/b1b3bde3-af12-d884-edf4-5ddbcb4688bc/stats" | jq .ResourceUsage.MemoryStats
{
  "Cache": 0,
  "KernelMaxUsage": 0,
  "KernelUsage": 0,
  "MappedFile": 0,
  "MaxUsage": 0,
  "Measured": [
    "Cache",
    "Swap",
    "Usage"
  ],
  "RSS": 0,
  "Swap": 0,
  "Usage": 1253376
}

You can see the list of keys under "Measured" that shows that we're unable to get the RSS field value because it's not getting exposed from Docker. The UI is doing its best to show you something under those circumstance so it falls back to showing the Usage.

I'm going to re-title this issue for clarity and mark it for roadmapping. Thanks for reporting this, @cberescu !

Alexsandr-Random commented 11 months ago

Hi there! Found this issue before creating my own Nomad don`t collect metric nomad.client.allocs.memory.rss for prometheus when cgroups v2 used. When cgroups set to v1 - everything scraping correctly. I do researches on 3 different independent nomad clusters so i could say this looks like bug. @tgross do i need to create independent issue only about that problem? And when approximately expect fix?

seanamos commented 11 months ago

This can be quite deceptive and at worst can hide real memory issues/leaks if you rely on Nomad's reported memory metrics.

Since the memory usage is consistently over-reported, people begin to ignore 99% memory usage as "normal". However, in the case of a memory leak, real memory usage will be creeping up and you will be blind to it, unless you also gather metrics directly from docker.

tgross commented 11 months ago

Nomad don`t collect metric nomad.client.allocs.memory.rss for prometheus when cgroups v2 used. When cgroups set to v1 - everything scraping correctly. I do researches on 3 different independent nomad clusters so i could say this looks like bug. @tgross do i need to create independent issue only about that problem? And when approximately expect fix?

No, that's this issue.
We don't commit publicly to timelines.

unless you also gather metrics directly from docker.

... and then do math on those metrics because Docker doesn't expose RSS either. So you can do that same math on metrics you get from Nomad. Neither of which is a good solution of course, which is why this is on the roadmap and not a "wontfix" :grinning:

tgross commented 11 months ago

Just for clarity, the code part of patching this is trivial, just a couple bits of arithmetic and bounds checking. But the only thing worse than missing metrics is giving wrong metrics.

The kernel documentation for cgroups v2 doesn't explicitly say which components make up memory.current (the value which Docker returns as Usage). Notice in those docs that memory.rss is not a value we can get directly from the kernel anymore under cgroups v2. We need to make sure we're subtracting all the components that aren't RSS in order to calculate RSS correctly. Otherwise we'd return the wrong RSS value.

Because it's not in the docs, that likely means doing a little spelunking in the kernel code, and that's the work that needs to be done to make this happen.

seanamos commented 11 months ago

What we ended up doing is, instead of relying on Nomad's memory metrics, we now rely on Datadog's docker.mem.in_use metric, which conveniently also gets automatically tagged with relevant Nomad tags (job, task etc.). More importantly, it appears accurate.

Out of interest, since the DD agent is open source, I went digging to see how they collect it. They must have run into the same issue, since they don't fetch the memory stats from docker, but rather from the cgroup.

tgross commented 11 months ago

Out of interest, since the DD agent is open source, I went digging to see how they collect it. They must have run into the same issue, since they don't fetch the memory stats from docker, but rather from the cgroup.

Which raises another possibility to fix this, which is to get it done upstream in Docker. But I suspect we have this same problem with metrics for exec drivers, etc. so it'd be nice to fix it here for sure.

hashicorp / nomad