Open jaloren opened 4 months ago
Hi @jaloren! Yeah we were wary of the high cardinality of those metrics which is why we gated them behind an opt-in flag to begin with. They're also fairly expensive to generate because we have to iterate over large chunks of the state store.
Dropping the alloc_id label would help with cardinality but not the expense of generating the metrics every n seconds, but maybe that's ok. I'm not sure about the specific aggregations you're talking about though... lots of our users have node counts in the 10k+ range. Although I suppose that allocations cycle out a lot more than nodes do. Is aggregating per-node all that meaningful, even that Nomad tries to spread allocs from the same job across nodes anyways? We definitely should do some thinking here about what this aggregation looks like.
In the meantime, I'm going to mark this issue as needing some further discussion and roadmapping.
@tgross Ah I wasn't considering large deployments, I could see how the node would turn into a high cardinality metric in that scenario. My reasoning is that no matter what people think you do end up with environmental differences and those differences can impact resource consumption and it would be useful to know that one node has elevated memory growth for an allocation but also nodes should be cattle so dropping it is probably fine.
My original suggestions were conservative and assumed the current design as a given, but I'd like throw out a more radical idea if there's an overarching concern about performance and cardinality.
That said, this is a significant change and dropping allocation, node id, and instance id would be fine for my usecase.
Apologies @jaloren, I've just realized that in my initial response I confused publish_allocation_metrics
which are published from the client, with the expensive work we do for job summary metrics which are published from the server (and which can be disabled with disable_dispatched_job_summary_metrics
). So dropping the alloc_id
field would actually be pretty useful on more moderate-sized clusters.
For the more radical architectural rethinking of metrics, we're somewhat settled on using hashicorp/go-metrics
because of cross-product consistency. But we could hypothetically implement any kind of MetricSink
interface we'd like and inject that as part of the agent configuration. Something I could imagine there is the ability to add a denylist configuration to the telemetry
block so that users can selectively omit specific metrics or labels.
Proposal
Nomad can produce fine-grained resource metrics for allocations. You need to opt in by setting the telemetry option publish_allocation_metrics. These metrics include the label "alloc_id". This is a UUID that will change frequently for numerous reasons (e.g deploying a new version/rescheduling etc). This turns alloc_id into a high cardinality dimension.
I propose one of the following solutions:
Use-cases
In systems like prometheus, cardinality is one of the biggest contributors to performance and cost of the system. See here for details. High cardinality dimensions are the most likely culprits of data loss (e.g metrics are dropped or truncated by the vendor due to rate limiting) and performance degradation. As a result, it is important to identify and eliminate high cardinality dimensions where feasible.