reduce high cardinality dimensions for allocation metrics

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.91k stars 1.95k forks source link

reduce high cardinality dimensions for allocation metrics #23373

Open jaloren opened 4 months ago

jaloren commented 4 months ago

Proposal

Nomad can produce fine-grained resource metrics for allocations. You need to opt in by setting the telemetry option publish_allocation_metrics. These metrics include the label "alloc_id". This is a UUID that will change frequently for numerous reasons (e.g deploying a new version/rescheduling etc). This turns alloc_id into a high cardinality dimension.

I propose one of the following solutions:

provide an option to omit the alloc_id label
provide an option to aggregate all allocations of a task group on a single node and produce a min, max, average
replace alloc_id with an alloc_counter (bikeshed on name) where on a single node the counter starts at 1 and then increments up for each additional allocation from the same task group

Use-cases

In systems like prometheus, cardinality is one of the biggest contributors to performance and cost of the system. See here for details. High cardinality dimensions are the most likely culprits of data loss (e.g metrics are dropped or truncated by the vendor due to rate limiting) and performance degradation. As a result, it is important to identify and eliminate high cardinality dimensions where feasible.

tgross commented 4 months ago

Hi @jaloren! Yeah we were wary of the high cardinality of those metrics which is why we gated them behind an opt-in flag to begin with. They're also fairly expensive to generate because we have to iterate over large chunks of the state store.

Dropping the alloc_id label would help with cardinality but not the expense of generating the metrics every n seconds, but maybe that's ok. I'm not sure about the specific aggregations you're talking about though... lots of our users have node counts in the 10k+ range. Although I suppose that allocations cycle out a lot more than nodes do. Is aggregating per-node all that meaningful, even that Nomad tries to spread allocs from the same job across nodes anyways? We definitely should do some thinking here about what this aggregation looks like.

In the meantime, I'm going to mark this issue as needing some further discussion and roadmapping.

jaloren commented 4 months ago

@tgross Ah I wasn't considering large deployments, I could see how the node would turn into a high cardinality metric in that scenario. My reasoning is that no matter what people think you do end up with environmental differences and those differences can impact resource consumption and it would be useful to know that one node has elevated memory growth for an allocation but also nodes should be cattle so dropping it is probably fine.

My original suggestions were conservative and assumed the current design as a given, but I'd like throw out a more radical idea if there's an overarching concern about performance and cardinality.

move the metric collections to the nomad agent where the allocations are running.
convert the metrics into logs that emit to either a file, stdout, or a tcp port. There would be some annoying bits that will need to be figured out with log rotation.
someone can use vector to map the log to a metric and ship it off to whatever sink they want. Vector supports both a prometheus remote write and an exporter interface.

That said, this is a significant change and dropping allocation, node id, and instance id would be fine for my usecase.

tgross commented 4 months ago

Apologies @jaloren, I've just realized that in my initial response I confused publish_allocation_metrics which are published from the client, with the expensive work we do for job summary metrics which are published from the server (and which can be disabled with disable_dispatched_job_summary_metrics). So dropping the alloc_id field would actually be pretty useful on more moderate-sized clusters.

For the more radical architectural rethinking of metrics, we're somewhat settled on using hashicorp/go-metrics because of cross-product consistency. But we could hypothetically implement any kind of MetricSink interface we'd like and inject that as part of the agent configuration. Something I could imagine there is the ability to add a denylist configuration to the telemetry block so that users can selectively omit specific metrics or labels.