newrelic-experimental / monitoring-kubernetes-with-opentelemetry

Apache License 2.0
9 stars 6 forks source link

The namespace filter for Events on the SNG Collector is not working #143

Closed tvalchev2 closed 4 months ago

tvalchev2 commented 6 months ago

Description

The namespace filter for Events is not working as intended. The Logs created from the Kubernetes Events are landing in wrong NewRelic accounts (seemingly random). Sometimes one Event/Log gets sent at the same time to multiple NewRelic accounts and the namespace filtering seems to not always be working.

For the time being I just disabled the events at the values.yaml for all our clusters.

Steps to Reproduce

Install version 0.8.1+ of nrotelk8s with events enabled (sng collector should be present on the cluster). Have at least 3-4 different "teams" alongside the "opsteam" since sometimes for some namespaces the filter is working, but the more namespaces you have, the more chances for it to fail there are. On a cluster with ~80 Namespaces (50-60 Teams or so) on every NewRelic Teams Account there were around 30-40 wrong namespace Event Logs.

Expected Behavior

The Kubernetes Event Logs are getting distributed and mapped to the propper corresponding NewRelic Account. There should be no events from another teams namespaces visible.

Relevant Logs / Console output

There are no errors or output from the sng collector itself. Just the wrong data is being sent to the wrong namespaces. I have created some NewRelic links from the times where I had the 0.8.1 Release rolled out from multiple Team/NR Accounts which I can provide in a troubleshooting meeting. Also I created a copy of the sng-collector ConfigMap, which I can also provide for troubleshooting purposes for a meeting.

Your Environment

Additional context

I have troubleshooted a bit, but I couldn't pinpoint the issue. There is potentially an error in the singleton template at Line 199: https://github.com/newrelic-experimental/monitoring-kubernetes-with-opentelemetry/blob/d234f3494a857934437217bdc1c6b5985d3e0abc/helm/charts/collectors/templates/singleton-otelcollector.yaml#L199 where deployment should be switched to singleton probably but this is not causing the issue, since this lies in the else block for when one is not using the global newrelic config and values and in our usecase this code is never executed.

Maybe the k8s.namespace.name is the wrong attribute for the filter here? From what I understood from the opentelemetry k8seventreceiver Repo the Namespace lies not within the attribute k8s.namespace.name, but: attrs.PutStr(semconv.AttributeK8SNamespaceName, ev.InvolvedObject.Namespace) https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/14d7d4d897259e8d582421d11812cce760487e96/receiver/k8seventsreceiver/k8s_event_to_logdata.go#L74

Or maybe line 133: https://github.com/newrelic-experimental/monitoring-kubernetes-with-opentelemetry/blob/d234f3494a857934437217bdc1c6b5985d3e0abc/helm/charts/collectors/templates/singleton-otelcollector.yaml#L133 should be extended with namespaces, according to the documentation: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/k8seventsreceiver/README.md#:~:text=for%20new%20events.-,Examples%3A,-k8s_events%3A%0A%20%20%20%20auth_type But then again this is the configuration for the receiver. All events from all namespaces are getting received, they are just mapped/dispatched to the wrong Teams/NR Accounts, so I suspect the problem lies probably somehow in the filter processors.

However I am by no means a specialist in opentelemetry, so maybe I am wrong.

utr1903 commented 5 months ago

@tvalchev2 The events are again being sent only opstem without any filter with #146 until the filtering is fixed.

utr1903 commented 5 months ago

Probable explanation: The k8seventreceiver is enriching logs with attribute not resource.attribute. The filterprocessoris however is considering only resource.attributes and therefore is not filtering properly.

-> Will be fixed in the next release!

utr1903 commented 4 months ago

Fixed with #156 and closing the issue. We can reopen if it still exists :)