Open shantanugadgil opened 1 year ago
Hi @shantanugadgil!
Most of the time when folks report high/climbing memory usage on the servers it's because of evaluations piling up from large batch workloads that aren't getting GC'd. But if that were the case here I'd expect to see the memory increase across all the servers (not just the leader) and nomad system gc
would help.
I would definitely expect the leader to use moderately more memory than other servers, because it does quite a bit more work. It has long-running processes that startup at leader transition, so I'd expect there to be a spike right away. But as you've noted if the memory isn't yielding back to the OS a while after the leader transition, either.
So all that suggests that we've got a leak. Given that you can reproduce this reliably, can you generate the following profiles for me from the leader and one of the servers that has stepped down from leadership for a little while (at least a few minutes)?
/debug/pprof/heap
/debug/pprof/goroutine
If you don't have enable_debug
on you can use the Agent Runtime Profiles API instead, just make sure you'd passing the right server_id
param.
You can tar/zip those up and send them to nomad-oss-debug@hashicorp.com. I'll do an analysis on those and report back here. I'll also see if I can repro here, but it's possible there's an environment-specific or feature-specific leak so it'd be best if we can get those profiles. Thanks!
Until I am able to get the memory dump, I'll post the image of a month from a different (bigger) PROD environment.
NOTE: the nodenames are the same in the different region, but as you can see they are much bigger machines :cry: :money_with_wings:
@tgross I got time to look at the agent runtime profiles API.
I think this may not work as-is.
We have not specified enable_debug
in the configuration file (so it would be unset
).
Also, we do not use ACLs.
After a bit of confusion, and then squinting hard at the docs, I think an enable_debug
would be needed at the least on the servers?
You don't need enable_debug
for the Agent Runtime Profiles API (that's why it's normally ACL-protected, enable_debug
enables the native golang pprof endpoint w/o protection), but if you don't have ACLs enabled (:grimacing:) it's just not protected. You should be able to hit the endpoint described there and pass the appropriate server ID parameter.
You don't need
enable_debug
for the Agent Runtime Profiles API (that's why it's normally ACL-protected,enable_debug
enables the native golang pprof endpoint w/o protection), but if you don't have ACLs enabled (grimacing) it's just not protected. You should be able to hit the endpoint described there and pass the appropriate server ID parameter.
Sorry, but I am confused.
We do not have ACLs, i.e. there is no acl
block at all.
The enable_debug
setting is also NOT mentioned anywhere in the configs.
[root]@[nomadserver-1]
$ grep -r debug /etc/nomad.d/
[root]@[nomadserver-1]
$ grep -r acl /etc/nomad.d/
Do you mean that I can somehow get profiling information without doing anything extra here?
When I try the following, I get a Permission denied
error.
ref: https://developer.hashicorp.com/nomad/api-docs/agent#sample-request-9
[ec2-user]@[nomadserver-1]
$ nomad server members
Name Address Port Status Leader Raft Version Build Datacenter Region
nomadserver-1.us-west-2 aa.bb.cc.ddd 4648 alive false 3 1.6.1 nomadserver us-west-2
nomadserver-2.us-west-2 aa.bb.cc.ddd 4648 alive false 3 1.6.1 nomadserver us-west-2
nomadserver-3.us-west-2 aa.bb.cc.ddd 4648 alive true 3 1.6.1 nomadserver us-west-2
[ec2-user]@[nomadserver-1]
$ curl -O http://localhost:4646/v1/agent/pprof/goroutine?server_id=nomadserver-1.us-west-2
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 17 100 17 0 0 22636 0 --:--:-- --:--:-- --:--:-- 17000
[ec2-user]@[nomadserver-1]
$ cat goroutine
Permission denied
Also, although off-topic ... I think the "/profile" and "/trace" example commands are incorrect ... shouldn't it be /pprof/profile
and /pprof/trace
In case some extra configuration is needed, what would need to be done?
I don't foresee us enabling ACLs. Maybe we could enable_debug
in a lower environment temporarily?
Ah I think you're right. I was misreading the chart at https://developer.hashicorp.com/nomad/api-docs/agent#default-behavior where it very clearly has as the first row:
Endpoint | enable_debug | ACLs | Available? |
---|---|---|---|
/v1/agent/pprof | unset | n/a | no |
Sorry about that. So yes, you'll need to enable_debug
to make this work.
Also, although off-topic ... I think the "/profile" and "/trace" example commands are incorrect ... shouldn't it be /pprof/profile and /pprof/trace
:facepalm: Oops, yes! I'll try to get a PR to fix that shortly. (Edit: https://github.com/hashicorp/nomad/pull/18142)
I have added enable_debug = true
in the configuration of the three servers and restarted the agents.
I have confirmed that the previously failing commands now succeed, and generate expected pprof data.
As I have restarted the agents. the symptom will take about 14 days to manifest itself.
As I have restarted the agents. the symptom will take about 14 days to manifest itself.
Fortunately we don't need to wait until the memory usage is super high, only that it's increased. If you were to take the profiles now and then again, say, 24h from now, we should definitely see a difference between the two (because your graphs show an increase of several GB a day). That should be enough to debug what's happening here.
You can tar/zip those up and send them to nomad-oss-debug@hashicorp.com. I'll do an analysis on those and report back here. I'll also see if I can repro here, but it's possible there's an environment-specific or feature-specific leak so it'd be best if we can get those profiles. Thanks!
Done, sent some logs.
Received. As you noted in the email, these are just the goroutines though so still incomplete. The leader (server-1) has almost same number of goroutines in the Aug 5 profile as the Aug 7 profile (635 vs 641), but that small variation is just servicing RPC requests.
The total number feels unusually high for a server, but... what immediately jumps out at me though is that these servers have goroutines in the client allocrunner code paths even though they're running server workloads. So these are both servers and clients! We strongly recommend against this because your workloads can starve the server.
So there's no goroutine leak here. I'm eagerly awaiting those heap profiles to see what we find there.
I didn't think to ask this before because I expected Nomad to be the only major application on the host, but have you confirmed it's actually the Nomad process that's using up the memory and not something else on the box?
(Post-edit remark: I originally had nonsense here about the artifact
block but I had pointed go tool pprof
to the wrong binary and that gave me misleading results.)
By the way if you want to see what I'm seeing with these profiles, you can use the analysis tools in the Go toolchain as follows: go tool pprof -trim=false -web /path/to/your/nomad/binary /path/to/2023-08-07-14-08-40.nomadserver-2.goroutine
. That'll bring up a web page that breaks down what goroutines are doing what. If you omit -trim=false
it'll end up hiding most of the client code path nodes, which is misleading here.
A command like go tool pprof -base=/path/to/previous/profile -web /path/to/nomad/binary /path/to/new/profile
will show you the difference between two goroutine profiles. So for example the server-1 profile diff looks like this (cropped for legibility):
And that shows that the diff is just RPC handlers, as expected.
The total number feels unusually high for a server, but... what immediately jumps out at me though is that these servers have goroutines in the client allocrunner code paths even though they're running server workloads. So these are both servers and clients! We strongly recommend against this because your workloads can starve the server.
The Nomad servers are in a separate datacenter
. There is one-and-only-one job running on the servers which does "free memory checking" to force restart the nomad.service
.
Making the servers also "clients" is a very recent change in our setup (about 2 (two) months ago). Server memory climbing like this is something we have observed on our setup since forever (about version 1.3.1 I think, not exactly sure)
Although we have Slack alerts from DataDog, we need to guarantee that the machine does not become "OOM" in case no one reacts to the alert. Hence, the local "mem free checker" job.
I didn't think to ask this before because I expected Nomad to be the only major application on the host, but have you confirmed it's actually the Nomad process that's using up the memory and not something else on the box?
Yes, Nomad is the only major application on this machine. In the next set of logs, I'll send ps
output as well.
@shantanugadgil I've received the new set of profiles... but none of these are the heap profile that I asked you for last week. You've collected and sent the goroutine
, the cmdline
, the profile
(which is CPU), and the trace
(an execution trace). We need heap
if we want to look at the memory usage of the heap.
(Aside, if you hit /debug/pprof/
on an instance with enable_debug=true
, you can see the Go stdlib's top-level page for describing all these profiles.)
Ok, got that set of new profiles, thanks @shantanugadgil! Ok, so what I can tell from the heap and goroutine profiles above is what we're not seeing:
publishJobSummaryMetrics
, which nearly half of that in an inner function of Prometheus called NewDesc
. I'll need to do some digging there.The next step will be to take the profiles again for the current leader nomadserver-1.us-west-2
. Wait at least 24h, but before it gets restarted. That'll give us enough of a window between the two to be able to see any differences and hopefully be able to put this to rest.
102 client nodes (excluding the header line, and the 3 servers)
$ nomad node status | wc -l
106
For "reasons" we have BOTH statsd
and prometheus
metrics enabled on all nodes, the same seems to have come into play for servers as well.
I just noticed that publish_allocation_metrics = false
in the config. Quite odd, this needs to be fixed.
_Once these experiments are done, I could disable the prometheus metrics on the servers. For now leaving the servers alone for the next profile collection._
$ cat /etc/nomad.d/telemetry.hcl
telemetry {
publish_allocation_metrics = false
publish_node_metrics = true
# reduce cardinality (?)
disable_hostname = true
collection_interval = "10s"
datadog_address = "127.0.0.1:8125"
prometheus_metrics = true
}
Ok, I've had a chance to examine the before and after profiles.
tl;dr you may want to try telemetry.disable_dispatched_job_summary_metrics=true
As you noted, the resident set size is increasing over time. If we look at the leader in the first bundle we see a resident size of 1.4G:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2508 root 20 0 3271320 1.4g 171432 S 0.0 4.7 255:34.73 /usr/sbin/nomad agent -config=/etc/nomad.d/
And after it's 1.6G, confirming that we have a ~200M growth:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2508 root 20 0 3610812 1.6g 172180 S 6.7 5.2 337:14.48 /usr/sbin/nomad agent -config=/etc/nomad.d/
If we look at the goroutine profiles with go tool pprof
, I see only the expected minimal differences between them as yamux streams get created and destroyed. So nothing of interest there.
The major difference I can see is in the heap profiles. As I noted earlier, there's a surprising amount of memory being used in the go-metrics library, specifically in the Promtheus sink and the children of its SetGaugeWithLabels
method. In the "before" profile, there's ~650MB of heap memory rooted in go-metrics.
In the "after" profile, there's ~787MB of heap memory rooted in go-metrics.
That's a difference of 137MB, and if I take a diff of the two profiles, I can see that makes up the meaningful part of the diff.
If I go further and use the -trim=false
flag to get the full diff, then I can see the remainder of the positive increase appears to be a few more Allocations in the state store and a little bit of heap supporting the couple of extra in-flight yamux connections. So the suspicion I had the other day that the Prometheus sink was involved seems to hold.
If we look at the non-leader servers, neither of them is showing such a large chunk of heap in go-metrics. Non-leader servers do emit metrics with SetGaugeWithLabels
, as do the clients. That suggests this isn't a memory leak in the library. The leader does have a subset of metrics that it emits for the eval broker and for job summaries.
The job summaries metrics (ref leader.go#L1268
) looks like it could end up being very high in cardinality, especially if you use a lot of dispatch jobs. There's a telemetry configuration option disable_dispatched_job_summary_metrics
that looks like it's designed to avoid this:
disable_dispatched_job_summary_metrics
(bool: false
)` - Specifies if Nomad should ignore jobs dispatched from a parameterized job when publishing job summary statistics. Since each job has a small memory overhead for tracking summary statistics, it is sometimes desired to trade these statistics for more memory when dispatching high volumes of jobs.
Ok, I've had a chance to examine the before and after profiles.
tl;dr you may want to try telemetry.disable_dispatched_job_summary_metrics=true
@tgross I'll try the above setting, but we do not use a lot of dispatch
jobs (i.e. parameterized
)
we do use a lot of periodic
and system
jobs.
system
jobs like vector, node_exporter, etc. which run per ASG (we have quite a few ASGs).
periodic
jobs are the actual workload jobs.
I might even go ahead and disable prometheus
on the servers. It currently achieves nothing, and having that enabled on the servers seems like an oversight.
but we do not use a lot of
dispatch
jobs (i.e.parameterized)
we do use a lot of
periodic
andsystem
jobs.
Oh, I'm realizing that the logic we have for disable_dispatched_job_summary
doesn't apply to periodic jobs but the same reason for it to exist applies. The number of distinct job names can explode rapidly for periodic jobs too, especially if they're scheduled frequently. We probably should be applying this logic not just when job.Dispatched = true
but any time job.ParentID != ""
. It's a small fix but let me ask around with my colleagues to see if anyone knows why it wasn't done that way in the first place.
I have gone ahead and made the following changes to the server configs and restarted the servers.
prometheus_metrics = false
disable_dispatched_job_summary_metrics = true
About periodic
jobs ... yes we have a lot of jobs which run every minute/every two minutes.
If there are any changes being planned to squish metrics do consider how the failing/pending/queued periodic jobs can be tracked going ahead.
If there are any changes being planned to squish metrics do consider how the failing/pending/queued periodic jobs can be tracked going ahead.
Yeah I don't love that we throw out those metrics entirely when disable_dispatched_job_summary_metrics = true
, and that would happen to periodic jobs.
We could conceivably merge all the summaries from all the dispatch/periodic instances with the same parent ID, which feels like what most people are going to want. But we'd need to keep a JobSummary
for each dispatch/periodic parent for the whole time we're iterating over the job summaries to generate these stats, and then throw it away. Conceivably that could be 1000s of extra JobSummary
objects, so I'd want to do a little benchmarking here.
We could conceivably merge all the summaries from all the dispatch/periodic instances with the same parent ID, which feels like what most people are going to want.
yes, this sounds fair. tracking via parent_id
is fine for periodic
jobs.
though for parameterized
jobs I wonder if the parameters (meta) themselves could somehow be exposed as tags
to determine which type of job is doing what.
... and I am not even considering parameterized
+ periodic
jobs :woozy_face:
though for
parameterized
jobs I wonder if the parameters (meta) themselves could somehow be exposed astags
to determine which type of job is doing what.
Not without blowing up the cardinality again, which is what we're trying to avoid here.
I'm going to rename this issue for clarity and try to carve out a little bit of time to experiment with some options soon. But just a heads up that I'm headed into a vacation until September, so I'll have to pick this back up then. :grinning:
I think it is the prometheus telemetry metrics setting and not the periodic job metrics.
For 5 (five) days now, I have the following setting in the Nomad server configuration and the graph is shown below.
The changes were done as per this comment: ref: https://github.com/hashicorp/nomad/issues/18113#issuecomment-1679494136
telemetry {
publish_allocation_metrics = false
publish_node_metrics = true
# reduce cardinality (?)
disable_hostname = true
collection_interval = "10s"
datadog_address = "127.0.0.1:8125"
prometheus_metrics = false
disable_dispatched_job_summary_metrics = true
}
Note the steady increase during the time I was sending debug logs and no memory increase since I switched off prometheus_metrics
I don't think periodic jobs had anything to do with the problem. :man_shrugging:
same observation on a "smaller prod" environment. debug was NOT enabled here ... only prometheus_metrics
changed to false
disable_dispatched_job_summary_metrics
is NOT mentioned in the config files in the "smaller prod" env.
Thanks @shantanugadgil. I suspect the Prometheus collector doesn't age out the metrics we're collecting for those periodic jobs the way I'd expect.
@tgross should the title be renamed to something about prometheus
then? rather than periodic jobs? just wondering.
Sure, done. I don't have a fix for this at the moment or the time to dig in further right now, but I'll mark this for roadmapping.
Some memory leak has come back in some way as part of version 1.7.2.
Sending logs to nomad-oss-debug@hashicorp.com
Nomad version
Nomad v1.6.1 BuildDate 2023-07-21T13:49:42Z Revision 515895c7690cdc72278018dc5dc58aca41204ccc
(this issue has always existed in all previous versions)
Operating system and Environment details
Amazon Linux 2
Issue
The server leader's memory keeps increasing with no sign of it flattening. We have to periodically restart the leader to counter this problem
Reproduction steps
You should be able to observe a graph similar to the following in you environment.
Expected Result
Server leader's memory should stay under control.
We run
nomad system gc
andnomad system reconcile summaries
every hour, but that does not help either.Actual Result
Leader memory keeps increasing
Job file (if appropriate)
N/A
Nomad Server logs (if appropriate)
N/A
Nomad Client logs (if appropriate)
N/A
FWIW, there are many other memory related issues, but they don't mention "server leader" specifically, hence creating a new issue.
Example: https://github.com/hashicorp/nomad/issues/17789#issuecomment-1638239984