rancher / opni

Multi Cluster Observability with AIOps
https://opni.io
Apache License 2.0
337 stars 53 forks source link

OTEL Metrics Collection : Batching & Label Processing (Optimization) #1163

Open alexandreLamarre opened 1 year ago

alexandreLamarre commented 1 year ago

Related to OEP

alexandreLamarre commented 1 year ago

Consider removing some host metrics for more streamlined processing

alexandreLamarre commented 1 year ago

not started, rolling over to sprint 4

alexandreLamarre commented 1 year ago

K8s resource attribute labels contain valuable data for downstream use, but for metrics we should drop container metadata and resource uid labels during processing on the node otel agents to reduce memory footprint

alexandreLamarre commented 1 year ago

did a ton of profiling of our otel agents, seems like the memory problems with the collector is related to kubeletstats receiver quickly filling up memory on startup and causing subsequent page faults when the large vectors are dropped (if the aggregator comes up last). This is kind of expected, so this won't be a problem

alexandreLamarre commented 1 year ago

since the action part of attribute processing process doesn't allow for matching keys to regex, we may need to "externalize" the portions of k8s.*.uuid into our struct that executes the templates

k8sattributes:
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection

to explicitly drop all relevant _uid labels on metrics.

  attributes/k8smetrics:
    include:
      match_type: regexp
      metric_names : ["^k8s"]
      attributes:
      - key: k8s_pod_uid
#...
    actions:
#...
      - action: delete
        key: k8s_pod_uid
alexandreLamarre commented 1 year ago

or we have a separate k8sattribute processor for metrics, but i think having the same shared context information for all observability data will also be important for aiops services

alexandreLamarre commented 1 year ago

also looking through node_exporter & kube_state metrics exporters documentation (exporters bundled with kube-prometheus-stack) to see which labels we can drop on metrics to further reduce load

Edit : since attribute processors don't support regex and necessarily these types of exporters have to be scraped with Prometheus it makes sense to have these types of labels dropped by an OTEL Prometheus receiver relabel config.

Since we are probably locked into using that relabel config, we should expose a field additionalRelabelConfigs on our capability spec.

alexandreLamarre commented 1 year ago

specifically for dropping labels on metrics we can't use attribute processors / metrictransforms/ base OTTLP functions on metrics, because dropping all values for a particular label is an unsafe operation (will the metric still be valid, how are data points associated after the drop?). We must do so with aggregations, but the aforementioned processors can only do aggregations when they know the result label set - which in our generic case is not possible.

We will have to either create our own OTTLP functions for the transform processor or include this logic in a metrics OTLP forward on the gateway. So will circle back to this issue once we have the gateway OTLP implementation and draw comparisons between the two methods.

alexandreLamarre commented 1 year ago

Spent a while playing around with aggregations on protobuf metrics ... I think implementing it as a custom processor which has access to the optimized metrics data structures (and useful methods) will ultimately be better.

I also want to take a look at potentially introducing or own OTEL instrument/ measurement proxy at the source to drop labels. It is much simpler than doing blind aggregations later in the metric lifetime.

alexandreLamarre commented 1 year ago

My experimentation with unsafe dropping of attributes will only affect metric "outcomes" instead of producing errors further down the processing line ... Still would rather figure out a way to detect if they effect the outcome of time-series and aggregate them away. In this case we end up also having to choose an opinionated metric aggregation (mean, min, max, sum, ...) which also affects the outcome of the time series.

alexandreLamarre commented 1 year ago

Since OTLP will likely be feature gated for the initial release, I will implement an otel collector processor for the unsafe dropping mechanism I had for OTLP.