Open cberescu opened 1 year ago
Hi @cberescu! Let me set a bit of context here...
Nomad's reporting of memory usage comes from the task driver, which is mapping the data we get back from Docker itself. You can see how that's done in stats_posix.go#L24-L38
:
ms := &cstructs.MemoryStats{
RSS: s.MemoryStats.Stats.Rss,
Cache: s.MemoryStats.Stats.Cache,
Swap: s.MemoryStats.Stats.Swap,
MappedFile: s.MemoryStats.Stats.MappedFile,
Usage: s.MemoryStats.Usage,
MaxUsage: s.MemoryStats.MaxUsage,
Measured: measuredMems,
}
The docker stats
command docs say this:
On Linux, the Docker CLI reports memory usage by subtracting cache usage from the total memory usage. The API does not perform such a calculation but rather provides the total memory usage and the amount from the cache so that clients can use the data as needed. The cache usage is defined as the value of total_inactive_file field in the memory.stat file on cgroup v1 hosts.
On Docker 19.03 and older, the cache usage was defined as the value of cache field. On cgroup v2 hosts, the cache usage is defined as the value of inactive_file field.
We then map these to metrics in the client's setGaugeForMemory
method. And then the values that get reported in places like the UI charts are from the Read Allocation Statistics API and we use the RSS to display memory usage, because cache memory isn't exclusive to the allocation and can be reclaimed by the OS at any time.
So given all that context :grinning: , where do you feel there's a gap in how the memory is being reported? Are you seeing discrepancy between what Docker reports and what Nomad reports? A gap between what nomad alloc status
reports and the UI? Or is it that you just want to see the cache value reported somewhere?
Hi @tgross ,
Thanks for explaining it a little more.
What i would love to see is the the real memory used without the cache . I think dockers s.MemoryStats.Usage
is with the cache included.
So I can see 2 ways (I am referring here in the UI)
what do you think ?
Ok, I see what you're saying. On cgroups v2 we don't get the RSS
value, so we only get Docker's Usage
, which combines the cache. That's frustrating and a weird API choice on their part. For example, on a task that's using just over 1MiB of RSS I get the following from the API:
$ curl -s "localhost:4646/v1/client/allocation/b1b3bde3-af12-d884-edf4-5ddbcb4688bc/stats" | jq .ResourceUsage.MemoryStats
{
"Cache": 0,
"KernelMaxUsage": 0,
"KernelUsage": 0,
"MappedFile": 0,
"MaxUsage": 0,
"Measured": [
"Cache",
"Swap",
"Usage"
],
"RSS": 0,
"Swap": 0,
"Usage": 1253376
}
You can see the list of keys under "Measured" that shows that we're unable to get the RSS
field value because it's not getting exposed from Docker. The UI is doing its best to show you something under those circumstance so it falls back to showing the Usage
.
I'm going to re-title this issue for clarity and mark it for roadmapping. Thanks for reporting this, @cberescu !
Hi there! Found this issue before creating my own Nomad don`t collect metric nomad.client.allocs.memory.rss for prometheus when cgroups v2 used. When cgroups set to v1 - everything scraping correctly. I do researches on 3 different independent nomad clusters so i could say this looks like bug. @tgross do i need to create independent issue only about that problem? And when approximately expect fix?
This can be quite deceptive and at worst can hide real memory issues/leaks if you rely on Nomad's reported memory metrics.
Since the memory usage is consistently over-reported, people begin to ignore 99% memory usage as "normal". However, in the case of a memory leak, real memory usage will be creeping up and you will be blind to it, unless you also gather metrics directly from docker.
Nomad don`t collect metric nomad.client.allocs.memory.rss for prometheus when cgroups v2 used. When cgroups set to v1 - everything scraping correctly. I do researches on 3 different independent nomad clusters so i could say this looks like bug. @tgross do i need to create independent issue only about that problem? And when approximately expect fix?
unless you also gather metrics directly from docker.
... and then do math on those metrics because Docker doesn't expose RSS either. So you can do that same math on metrics you get from Nomad. Neither of which is a good solution of course, which is why this is on the roadmap and not a "wontfix" :grinning:
Just for clarity, the code part of patching this is trivial, just a couple bits of arithmetic and bounds checking. But the only thing worse than missing metrics is giving wrong metrics.
The kernel documentation for cgroups v2 doesn't explicitly say which components make up memory.current
(the value which Docker returns as Usage
). Notice in those docs that memory.rss
is not a value we can get directly from the kernel anymore under cgroups v2. We need to make sure we're subtracting all the components that aren't RSS in order to calculate RSS correctly. Otherwise we'd return the wrong RSS value.
Because it's not in the docs, that likely means doing a little spelunking in the kernel code, and that's the work that needs to be done to make this happen.
What we ended up doing is, instead of relying on Nomad's memory metrics, we now rely on Datadog's docker.mem.in_use
metric, which conveniently also gets automatically tagged with relevant Nomad tags (job, task etc.). More importantly, it appears accurate.
Out of interest, since the DD agent is open source, I went digging to see how they collect it. They must have run into the same issue, since they don't fetch the memory stats from docker, but rather from the cgroup.
Out of interest, since the DD agent is open source, I went digging to see how they collect it. They must have run into the same issue, since they don't fetch the memory stats from docker, but rather from the cgroup.
Which raises another possibility to fix this, which is to get it done upstream in Docker. But I suspect we have this same problem with metrics for exec
drivers, etc. so it'd be nice to fix it here for sure.
Nomad version
Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)
Operating system and Environment details
NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)"
Issue
The reporting of memory use by docker containers is not correctly shown. The issue i think is from : https://github.com/moby/moby/issues/10824#issuecomment-292778896
Reproduction steps
Create a docker job with an app that creates / read a lot of files . Like a image cache app
Expected Result
Show correct memory utilization
Actual Result
It shows memory including the cache.