signalfx / splunk-otel-collector-chart

Splunk OpenTelemetry Collector for Kubernetes
Apache License 2.0
116 stars 146 forks source link

Disconnect between K8s and Host dashboards and OOTB Attributes/Dimensions #337

Closed TylerHelmuth closed 10 months ago

TylerHelmuth commented 2 years ago

We are experiencing frustration around the OOTB Splunk Distribution of the Collector and the K8s Dashboards and Host dashboard. Those dashboards expect the old style variable names such as kubernetes_cluster, kubernetes_node, kubernetes_pod_name, kubernetesnamespace, and host. But the default charts generate a collector that only correlates traces and metrics on the OTEL specific names. As a result, if we want any Global Data Links from traces to both Splunk IM and the K8s dashboards we have to manually configure the collector to add the kubernetes* and host attributes or manually add the values as tags to the traces themselves.

We would like the charts to include the Splunk specific tags as well or update the dashboards to use the OTEL specifications.

aunshc commented 2 years ago

Thanks for voicing this @TylerHelmuth - Joe deBlaquiere and I are looking into this.

jadbSFx commented 2 years ago

Hi Tyler, You should not need to add additional dimensions/properties on the data you're sending in. The mapping service (see: https://docs.splunk.com/Observability/gdi/migrate/mapping-service.html for an overview) should generally treat the OTel semantics as equivalent to the legacy semantics for the purposes of analytics. For instance we treat k8s.cluster.name as equivalent to kubernetes_cluster in analytics jobs by essentially rewriting the metadata query to include an OR clause. In time we are migrating the OOTB dashboards, navigators and detectors to use OTel semantics. Since these are equivalencies legacy custom dashboards and detectors should work for the new OTel semantics. Similarly content using the new semantics works with data using legacy conventions.

Incidentally, for different reasons, the trace data should all be following OTel semantics (and translated on ingest if submitted with legacy conventions).

I think the one thing you point out that can be a bit of a challenge is Global Data Links. These Data Links are tied to specific metadata semantics. My recommendation (as a practical workaround) here would be to create duplicate Global Data Links for both semantic conventions. If you define two Global Data Links, one for kubernetes_cluster and one for k8s.cluster.name that both point to the same destination that should "just work"... however that brings me to another challenge.

The second challenge you may encounter here is that when you define the data link for k8.cluster.name to target a dashboard with a Dashboard Variable for kubernetes_cluster it won't set the Dashboard Variable to that value. Instead it will set a filter on the dashboard for k8s.cluster.name:{{source_value}}. This can result in slightly different experience because of how the preferences are set on the individual Dashboard Variables. Dashboard Variables have two preferences, one defines if that filter applies to all charts or only those with explicit filters and one which determines whether to exclude data which doesn't have that metadata defined. The filter does not have similar options/restrictions.

One final place you will see the impact of mapping service on presentation is in the chart data tables. If you have data from multiple semantic conventions you will potentially see both sets of dimensions used in the data tables, however if there are filters or aggregation functions which are applied to the data then the result of the analytics jobs will follow the conventions of the filters. So adding filter=filter('k8s.pod.uid', '*') to a signalflow expression will enforce the semantic convention for that specific metadata value and the results for all MTS regardless of original metadata convention will include the dimension value in the k8s.pod.uid column. In any case the Data Links available to results in any specific result column will only be those assigned to the specific convention. So selecting a result value for kubernetes_cluster will only show available data links for that key name, and it will not show data links for k8.cluster.name.

I ended up building some examples to test these behaviors, so if you would like to walk through those and discuss please let me know (via email).

TylerHelmuth commented 2 years ago

What we've been focused on recently has definitely been the links to dashboards. The experience we've been having is that the collector correctly sets all the OTEL spec tags like k8s.pod.name and host.name, but if you configure those values to go to one of Splunk's predefined dashboard they don't work; they don't try to add any extra filters instead of variables either.

We love the move to OTEL, and the experience with the Splunk Distribution of the Open Telemetry collector has been OTEL first since we've been using it, which is awesome, but it definitely feels like the UI has not caught up.

Since the Observability UI is what our IT org uses to troubleshoot, we want to provide the best experience possible. Right now it seems like that experience is to either update the Observability UI to be OTEL first, or update the collector to include all the stuff Observability UI depends on in addition to the OTEL tags.

This in-between land we are in where all the data is perfect and Splunk APM and Splunk IM play nice, but the built-in dashboards (which have some pretty significant advantages to Splunk IM sometimes) don't work when following Splunk's best practices is pretty annoying

github-actions[bot] commented 11 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.