prometheus metrics for job summaries cause ever-increasing memory on leader

shantanugadgil commented 1 year ago

Nomad version

Nomad v1.6.1 BuildDate 2023-07-21T13:49:42Z Revision 515895c7690cdc72278018dc5dc58aca41204ccc

(this issue has always existed in all previous versions)

Operating system and Environment details

Amazon Linux 2

Issue

The server leader's memory keeps increasing with no sign of it flattening. We have to periodically restart the leader to counter this problem

Reproduction steps

You should be able to observe a graph similar to the following in you environment.

Leader changed within the cluster, but notice that the previous leader's memory never reduces. Also new leader's memory starts climbing rapidly.
Previous leader, memory never reduces but keeps increasing gradually

Expected Result

Server leader's memory should stay under control.

We run nomad system gc and nomad system reconcile summaries every hour, but that does not help either.

Actual Result

Leader memory keeps increasing

Job file (if appropriate)

N/A

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

FWIW, there are many other memory related issues, but they don't mention "server leader" specifically, hence creating a new issue.

Example: https://github.com/hashicorp/nomad/issues/17789#issuecomment-1638239984

tgross commented 1 year ago

Hi @shantanugadgil!

Most of the time when folks report high/climbing memory usage on the servers it's because of evaluations piling up from large batch workloads that aren't getting GC'd. But if that were the case here I'd expect to see the memory increase across all the servers (not just the leader) and nomad system gc would help.

I would definitely expect the leader to use moderately more memory than other servers, because it does quite a bit more work. It has long-running processes that startup at leader transition, so I'd expect there to be a spike right away. But as you've noted if the memory isn't yielding back to the OS a while after the leader transition, either.

So all that suggests that we've got a leak. Given that you can reproduce this reliably, can you generate the following profiles for me from the leader and one of the servers that has stepped down from leadership for a little while (at least a few minutes)?

/debug/pprof/heap
/debug/pprof/goroutine

If you don't have enable_debug on you can use the Agent Runtime Profiles API instead, just make sure you'd passing the right server_id param.

You can tar/zip those up and send them to nomad-oss-debug@hashicorp.com. I'll do an analysis on those and report back here. I'll also see if I can repro here, but it's possible there's an environment-specific or feature-specific leak so it'd be best if we can get those profiles. Thanks!

shantanugadgil commented 1 year ago

Until I am able to get the memory dump, I'll post the image of a month from a different (bigger) PROD environment.

NOTE: the nodenames are the same in the different region, but as you can see they are much bigger machines :cry: :money_with_wings:

shantanugadgil commented 1 year ago

@tgross I got time to look at the agent runtime profiles API.

I think this may not work as-is.

We have not specified enable_debug in the configuration file (so it would be unset). Also, we do not use ACLs.

After a bit of confusion, and then squinting hard at the docs, I think an enable_debug would be needed at the least on the servers?

tgross commented 1 year ago

You don't need enable_debug for the Agent Runtime Profiles API (that's why it's normally ACL-protected, enable_debug enables the native golang pprof endpoint w/o protection), but if you don't have ACLs enabled (:grimacing:) it's just not protected. You should be able to hit the endpoint described there and pass the appropriate server ID parameter.

shantanugadgil commented 1 year ago

You don't need enable_debug for the Agent Runtime Profiles API (that's why it's normally ACL-protected, enable_debug enables the native golang pprof endpoint w/o protection), but if you don't have ACLs enabled (grimacing) it's just not protected. You should be able to hit the endpoint described there and pass the appropriate server ID parameter.

Sorry, but I am confused.

We do not have ACLs, i.e. there is no acl block at all. The enable_debug setting is also NOT mentioned anywhere in the configs.


[root]@[nomadserver-1]
$ grep -r debug /etc/nomad.d/

[root]@[nomadserver-1]
$ grep -r acl /etc/nomad.d/

Do you mean that I can somehow get profiling information without doing anything extra here?

When I try the following, I get a Permission denied error.

ref: https://developer.hashicorp.com/nomad/api-docs/agent#sample-request-9


[ec2-user]@[nomadserver-1]
$ nomad server members
Name                     Address       Port  Status  Leader  Raft Version  Build  Datacenter   Region
nomadserver-1.us-west-2  aa.bb.cc.ddd  4648  alive   false   3             1.6.1  nomadserver  us-west-2
nomadserver-2.us-west-2  aa.bb.cc.ddd  4648  alive   false   3             1.6.1  nomadserver  us-west-2
nomadserver-3.us-west-2  aa.bb.cc.ddd  4648  alive   true    3             1.6.1  nomadserver  us-west-2

[ec2-user]@[nomadserver-1]
$ curl -O http://localhost:4646/v1/agent/pprof/goroutine?server_id=nomadserver-1.us-west-2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    17  100    17    0     0  22636      0 --:--:-- --:--:-- --:--:-- 17000

[ec2-user]@[nomadserver-1]
$ cat goroutine 
Permission denied

Also, although off-topic ... I think the "/profile" and "/trace" example commands are incorrect ... shouldn't it be /pprof/profile and /pprof/trace

In case some extra configuration is needed, what would need to be done?

I don't foresee us enabling ACLs. Maybe we could enable_debug in a lower environment temporarily?

tgross commented 1 year ago

Ah I think you're right. I was misreading the chart at https://developer.hashicorp.com/nomad/api-docs/agent#default-behavior where it very clearly has as the first row:

Endpoint	enable_debug	ACLs	Available?
/v1/agent/pprof	unset	n/a	no

Sorry about that. So yes, you'll need to enable_debug to make this work.

Also, although off-topic ... I think the "/profile" and "/trace" example commands are incorrect ... shouldn't it be /pprof/profile and /pprof/trace

:facepalm: Oops, yes! I'll try to get a PR to fix that shortly. (Edit: https://github.com/hashicorp/nomad/pull/18142)

shantanugadgil commented 1 year ago

I have added enable_debug = true in the configuration of the three servers and restarted the agents.

I have confirmed that the previously failing commands now succeed, and generate expected pprof data.

As I have restarted the agents. the symptom will take about 14 days to manifest itself.

tgross commented 1 year ago

As I have restarted the agents. the symptom will take about 14 days to manifest itself.

Fortunately we don't need to wait until the memory usage is super high, only that it's increased. If you were to take the profiles now and then again, say, 24h from now, we should definitely see a difference between the two (because your graphs show an increase of several GB a day). That should be enough to debug what's happening here.

shantanugadgil commented 1 year ago

You can tar/zip those up and send them to nomad-oss-debug@hashicorp.com. I'll do an analysis on those and report back here. I'll also see if I can repro here, but it's possible there's an environment-specific or feature-specific leak so it'd be best if we can get those profiles. Thanks!

Done, sent some logs.

tgross commented 1 year ago

Received. As you noted in the email, these are just the goroutines though so still incomplete. The leader (server-1) has almost same number of goroutines in the Aug 5 profile as the Aug 7 profile (635 vs 641), but that small variation is just servicing RPC requests.

The total number feels unusually high for a server, but... what immediately jumps out at me though is that these servers have goroutines in the client allocrunner code paths even though they're running server workloads. So these are both servers and clients! We strongly recommend against this because your workloads can starve the server.

So there's no goroutine leak here. I'm eagerly awaiting those heap profiles to see what we find there.

I didn't think to ask this before because I expected Nomad to be the only major application on the host, but have you confirmed it's actually the Nomad process that's using up the memory and not something else on the box?

(Post-edit remark: I originally had nonsense here about the artifact block but I had pointed go tool pprof to the wrong binary and that gave me misleading results.)

tgross commented 1 year ago

By the way if you want to see what I'm seeing with these profiles, you can use the analysis tools in the Go toolchain as follows: go tool pprof -trim=false -web /path/to/your/nomad/binary /path/to/2023-08-07-14-08-40.nomadserver-2.goroutine. That'll bring up a web page that breaks down what goroutines are doing what. If you omit -trim=false it'll end up hiding most of the client code path nodes, which is misleading here.

A command like go tool pprof -base=/path/to/previous/profile -web /path/to/nomad/binary /path/to/new/profile will show you the difference between two goroutine profiles. So for example the server-1 profile diff looks like this (cropped for legibility):

Screenshot at 2023-08-07 13-39-19

And that shows that the diff is just RPC handlers, as expected.

shantanugadgil commented 1 year ago

The total number feels unusually high for a server, but... what immediately jumps out at me though is that these servers have goroutines in the client allocrunner code paths even though they're running server workloads. So these are both servers and clients! We strongly recommend against this because your workloads can starve the server.

The Nomad servers are in a separate datacenter. There is one-and-only-one job running on the servers which does "free memory checking" to force restart the nomad.service.

Making the servers also "clients" is a very recent change in our setup (about 2 (two) months ago). Server memory climbing like this is something we have observed on our setup since forever (about version 1.3.1 I think, not exactly sure)

Although we have Slack alerts from DataDog, we need to guarantee that the machine does not become "OOM" in case no one reacts to the alert. Hence, the local "mem free checker" job.

I didn't think to ask this before because I expected Nomad to be the only major application on the host, but have you confirmed it's actually the Nomad process that's using up the memory and not something else on the box?

Yes, Nomad is the only major application on this machine. In the next set of logs, I'll send ps output as well.

tgross commented 1 year ago

@shantanugadgil I've received the new set of profiles... but none of these are the heap profile that I asked you for last week. You've collected and sent the goroutine, the cmdline, the profile (which is CPU), and the trace (an execution trace). We need heap if we want to look at the memory usage of the heap.

(Aside, if you hit /debug/pprof/ on an instance with enable_debug=true, you can see the Go stdlib's top-level page for describing all these profiles.)

tgross commented 1 year ago

Ok, got that set of new profiles, thanks @shantanugadgil! Ok, so what I can tell from the heap and goroutine profiles above is what we're not seeing:

The state store memory isn't abnormally ballooned because of evals. We'd see that in the followers too, anyways.
It doesn't look like the leader is getting hammered with requests that's causing a lot of memory usage because of serialization costs.
There isn't an obvious goroutine leak. I can't tell exactly how many we'd need without knowing how many client nodes, workloads, etc. you have. But this looks roughly what I'd expect to see given a mid-size cluster (10s to low 100s nodes?).
One possible oddity: it looks like a surprisingly large chunk of memory is getting used writing Prometheus metrics for a function on the leader called publishJobSummaryMetrics, which nearly half of that in an inner function of Prometheus called NewDesc. I'll need to do some digging there.

The next step will be to take the profiles again for the current leader nomadserver-1.us-west-2. Wait at least 24h, but before it gets restarted. That'll give us enough of a window between the two to be able to see any differences and hopefully be able to put this to rest.

shantanugadgil commented 1 year ago

102 client nodes (excluding the header line, and the 3 servers)

$ nomad node status | wc -l
106

For "reasons" we have BOTH statsd and prometheus metrics enabled on all nodes, the same seems to have come into play for servers as well.

I just noticed that publish_allocation_metrics = false in the config. Quite odd, this needs to be fixed.

_Once these experiments are done, I could disable the prometheus metrics on the servers. For now leaving the servers alone for the next profile collection._

$ cat /etc/nomad.d/telemetry.hcl 
telemetry {
  publish_allocation_metrics = false

  publish_node_metrics       = true

  # reduce cardinality (?)
  disable_hostname           = true

  collection_interval        = "10s"

  datadog_address            = "127.0.0.1:8125"
  prometheus_metrics         = true
}

tgross commented 1 year ago

Ok, I've had a chance to examine the before and after profiles.

tl;dr you may want to try telemetry.disable_dispatched_job_summary_metrics=true

As you noted, the resident set size is increasing over time. If we look at the leader in the first bundle we see a resident size of 1.4G:

PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2508 root      20   0 3271320   1.4g 171432 S   0.0  4.7 255:34.73 /usr/sbin/nomad agent -config=/etc/nomad.d/

And after it's 1.6G, confirming that we have a ~200M growth:

PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2508 root      20   0 3610812   1.6g 172180 S   6.7  5.2 337:14.48 /usr/sbin/nomad agent -config=/etc/nomad.d/

If we look at the goroutine profiles with go tool pprof, I see only the expected minimal differences between them as yamux streams get created and destroyed. So nothing of interest there.

The major difference I can see is in the heap profiles. As I noted earlier, there's a surprising amount of memory being used in the go-metrics library, specifically in the Promtheus sink and the children of its SetGaugeWithLabels method. In the "before" profile, there's ~650MB of heap memory rooted in go-metrics.

before

In the "after" profile, there's ~787MB of heap memory rooted in go-metrics.

after

That's a difference of 137MB, and if I take a diff of the two profiles, I can see that makes up the meaningful part of the diff.

diff

If I go further and use the -trim=false flag to get the full diff, then I can see the remainder of the positive increase appears to be a few more Allocations in the state store and a little bit of heap supporting the couple of extra in-flight yamux connections. So the suspicion I had the other day that the Prometheus sink was involved seems to hold.

If we look at the non-leader servers, neither of them is showing such a large chunk of heap in go-metrics. Non-leader servers do emit metrics with SetGaugeWithLabels, as do the clients. That suggests this isn't a memory leak in the library. The leader does have a subset of metrics that it emits for the eval broker and for job summaries.

The job summaries metrics (ref leader.go#L1268) looks like it could end up being very high in cardinality, especially if you use a lot of dispatch jobs. There's a telemetry configuration option disable_dispatched_job_summary_metrics that looks like it's designed to avoid this:

disable_dispatched_job_summary_metrics (bool: false)` - Specifies if Nomad should ignore jobs dispatched from a parameterized job when publishing job summary statistics. Since each job has a small memory overhead for tracking summary statistics, it is sometimes desired to trade these statistics for more memory when dispatching high volumes of jobs.

shantanugadgil commented 1 year ago

Ok, I've had a chance to examine the before and after profiles.

tl;dr you may want to try telemetry.disable_dispatched_job_summary_metrics=true

@tgross I'll try the above setting, but we do not use a lot of dispatch jobs (i.e. parameterized)

we do use a lot of periodic and system jobs.

system jobs like vector, node_exporter, etc. which run per ASG (we have quite a few ASGs). periodic jobs are the actual workload jobs.

I might even go ahead and disable prometheus on the servers. It currently achieves nothing, and having that enabled on the servers seems like an oversight.

tgross commented 1 year ago

but we do not use a lot of dispatch jobs (i.e. parameterized)

we do use a lot of periodic and system jobs.

Oh, I'm realizing that the logic we have for disable_dispatched_job_summary doesn't apply to periodic jobs but the same reason for it to exist applies. The number of distinct job names can explode rapidly for periodic jobs too, especially if they're scheduled frequently. We probably should be applying this logic not just when job.Dispatched = true but any time job.ParentID != "". It's a small fix but let me ask around with my colleagues to see if anyone knows why it wasn't done that way in the first place.

shantanugadgil commented 1 year ago

I have gone ahead and made the following changes to the server configs and restarted the servers.

  prometheus_metrics         = false

  disable_dispatched_job_summary_metrics = true

About periodic jobs ... yes we have a lot of jobs which run every minute/every two minutes.

If there are any changes being planned to squish metrics do consider how the failing/pending/queued periodic jobs can be tracked going ahead.

tgross commented 1 year ago

If there are any changes being planned to squish metrics do consider how the failing/pending/queued periodic jobs can be tracked going ahead.

Yeah I don't love that we throw out those metrics entirely when disable_dispatched_job_summary_metrics = true, and that would happen to periodic jobs.

We could conceivably merge all the summaries from all the dispatch/periodic instances with the same parent ID, which feels like what most people are going to want. But we'd need to keep a JobSummary for each dispatch/periodic parent for the whole time we're iterating over the job summaries to generate these stats, and then throw it away. Conceivably that could be 1000s of extra JobSummary objects, so I'd want to do a little benchmarking here.

shantanugadgil commented 1 year ago

We could conceivably merge all the summaries from all the dispatch/periodic instances with the same parent ID, which feels like what most people are going to want.

yes, this sounds fair. tracking via parent_id is fine for periodic jobs.

though for parameterized jobs I wonder if the parameters (meta) themselves could somehow be exposed as tags to determine which type of job is doing what.

... and I am not even considering parameterized + periodic jobs :woozy_face:

tgross commented 1 year ago

though for parameterized jobs I wonder if the parameters (meta) themselves could somehow be exposed as tags to determine which type of job is doing what.

Not without blowing up the cardinality again, which is what we're trying to avoid here.

I'm going to rename this issue for clarity and try to carve out a little bit of time to experiment with some options soon. But just a heads up that I'm headed into a vacation until September, so I'll have to pick this back up then. :grinning:

shantanugadgil commented 1 year ago

I think it is the prometheus telemetry metrics setting and not the periodic job metrics.

For 5 (five) days now, I have the following setting in the Nomad server configuration and the graph is shown below.

The changes were done as per this comment: ref: https://github.com/hashicorp/nomad/issues/18113#issuecomment-1679494136

telemetry {
  publish_allocation_metrics = false

  publish_node_metrics       = true

  # reduce cardinality (?)
  disable_hostname           = true

  collection_interval        = "10s"

  datadog_address            = "127.0.0.1:8125"
  prometheus_metrics         = false

  disable_dispatched_job_summary_metrics = true
}

Note the steady increase during the time I was sending debug logs and no memory increase since I switched off prometheus_metrics

I don't think periodic jobs had anything to do with the problem. :man_shrugging:

shantanugadgil commented 1 year ago

same observation on a "smaller prod" environment. debug was NOT enabled here ... only prometheus_metrics changed to false

disable_dispatched_job_summary_metrics is NOT mentioned in the config files in the "smaller prod" env.

tgross commented 1 year ago

Thanks @shantanugadgil. I suspect the Prometheus collector doesn't age out the metrics we're collecting for those periodic jobs the way I'd expect.

shantanugadgil commented 1 year ago

@tgross should the title be renamed to something about prometheus then? rather than periodic jobs? just wondering.

tgross commented 1 year ago

Sure, done. I don't have a fix for this at the moment or the time to dig in further right now, but I'll mark this for roadmapping.

shantanugadgil commented 11 months ago

Some memory leak has come back in some way as part of version 1.7.2.

Sending logs to nomad-oss-debug@hashicorp.com

hashicorp / nomad