Open alexandreLamarre opened 1 year ago
Consider removing some host metrics for more streamlined processing
not started, rolling over to sprint 4
K8s resource attribute labels contain valuable data for downstream use, but for metrics we should drop container metadata and resource uid labels during processing on the node otel agents to reduce memory footprint
did a ton of profiling of our otel agents, seems like the memory problems with the collector is related to kubeletstats receiver quickly filling up memory on startup and causing subsequent page faults when the large vectors are dropped (if the aggregator comes up last). This is kind of expected, so this won't be a problem
since the action
part of attribute processing process doesn't allow for matching keys to regex, we may need to "externalize" the portions of k8s.*.uuid
into our struct that executes the templates
k8sattributes:
passthrough: false
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
to explicitly drop all relevant _uid
labels on metrics.
attributes/k8smetrics:
include:
match_type: regexp
metric_names : ["^k8s"]
attributes:
- key: k8s_pod_uid
#...
actions:
#...
- action: delete
key: k8s_pod_uid
or we have a separate k8sattribute processor for metrics, but i think having the same shared context information for all observability data will also be important for aiops services
also looking through node_exporter
& kube_state
metrics exporters documentation (exporters bundled with kube-prometheus-stack) to see which labels we can drop on metrics to further reduce load
Edit : since attribute processors don't support regex and necessarily these types of exporters have to be scraped with Prometheus it makes sense to have these types of labels dropped by an OTEL Prometheus receiver relabel config.
Since we are probably locked into using that relabel config, we should expose a field additionalRelabelConfigs
on our capability spec.
specifically for dropping labels on metrics we can't use attribute processors / metrictransforms/ base OTTLP functions on metrics, because dropping all values for a particular label is an unsafe operation (will the metric still be valid, how are data points associated after the drop?). We must do so with aggregations, but the aforementioned processors can only do aggregations when they know the result label set - which in our generic case is not possible.
We will have to either create our own OTTLP functions for the transform processor or include this logic in a metrics OTLP forward on the gateway. So will circle back to this issue once we have the gateway OTLP implementation and draw comparisons between the two methods.
Spent a while playing around with aggregations on protobuf metrics ... I think implementing it as a custom processor which has access to the optimized metrics data structures (and useful methods) will ultimately be better.
I also want to take a look at potentially introducing or own OTEL instrument/ measurement proxy at the source to drop labels. It is much simpler than doing blind aggregations later in the metric lifetime.
My experimentation with unsafe dropping of attributes will only affect metric "outcomes" instead of producing errors further down the processing line ... Still would rather figure out a way to detect if they effect the outcome of time-series and aggregate them away. In this case we end up also having to choose an opinionated metric aggregation (mean, min, max, sum, ...) which also affects the outcome of the time series.
Since OTLP will likely be feature gated for the initial release, I will implement an otel collector processor for the unsafe dropping mechanism I had for OTLP.
Related to OEP