newrelic-experimental / monitoring-kubernetes-with-opentelemetry

Apache License 2.0
9 stars 6 forks source link

Provide the possibility to implement drop rules for Logs and potentially Metrics #160

Open tvalchev2 opened 3 months ago

tvalchev2 commented 3 months ago

Summary

At the moment we have drop rules at NewRelic to prevent unwanted Logs being ingested and producing Costs. Example of dropped Logs - healthcheck logs from the pods/applciations in the pods and success logs with status Code 200 (for nginx requests for example). However these logs are still getting ingested by opentelemetry and sent to NewRelic Endpoint, producing unneeded Load on the collectors/senders.

Another use case would be to be able to exclude metrics for certain pods/labels. For example in kubernetes if you have a CronJob, which is configured to save the last 5 runs, then there are 5 completed job Pods in the namespace. These provide some container metrics like pod_status_phase or similar, that are scraped from kube-state-metrics and produce also insane amounts of Ingest, which we drop manually in NewRelic via drop rules. This would help reduce also the size of kube-state-metrics scraping, also helping the controller and taking load off of them.

Desired Behavior

One should be able to define drop/filtering rules where Logs with certain pattern won't get sent to NewRelic and won't have to be handled by the collector (at least not for sending them)

Possible Solution

This should be possible in my opinion with the filterprocessor, but we need to be able to define those droprules at the HelmChart. In example: 2024-08-06 11:17:48.550 INFO (qtp1597328335-63) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/health params={} status=0 QTime=0

processors:
  filter/drop_logs_by_body_regex:
    logs:
      log_record:
        - 'IsMatch(body, ".*path=/admin/info/health.*")'

This should drop the HealthChecks for a Solr application running on the cluster (maybe my regex is wrong and one needs to backslash the backslashes, but I guess you get the point). This healthcheck is running every 5 seconds, so it produces a huge amount of logs and Strain on the collector when it has to send like 1000 such log messages every 5 seconds.

Or here a 2nd example for dropping lets say based on an attribute:

processors:
  filter/drop_logs_by_label_values_regex:
    logs:
      log_record:
        - IsMatch(attributes["http.method"], "GET|POST")

Additional context