Spanmetrics Connector are not giving correct metrics of spans

harshraigroww commented 1 year ago

Component(s)

connector/spanmetrics

What happened?

Description

I am using spanmetrics connector to generate metrics from span. The calls metrics is counter metrics so it's value should always increase but i can see graph going up and down of metrics generated by this connector

Steps to Reproduce

Using spanmetrics connector config and passing it as exporter in trace pipeline and receiving it in metrics pipeline.

connectors:
   count:
    servicegraph:
      latency_histogram_buckets: [5ms, 30ms, 100ms, 500ms, 2s]
      dimensions:
        - span.kind
      #   - dimension-2
      store:
        ttl: 5s
        max_items: 5000
    spanmetrics:
      histogram:
        explicit:
          buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 2s]
      dimensions:
        # - name: http.method
        #   default: GET
        - name: http.status_code
      dimensions_cache_size: 100000
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

Expected Result

calls and duration_count is counter metrics so there value should always increase

Actual Result

When graph ploted using these metrics it's going up and down

Collector version

0.75.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

exporters:
    otlp:
      headers:
        x-scope-orgid: ****
      endpoint: *.*.*.*:4317
      tls:
        insecure: true
    prometheus:
      endpoint: "0.0.0.0:8081"
  extensions:
    health_check: {}
    memory_ballast: {}
  processors:
    batch: {}
    tail_sampling:
      decision_wait: 10s
      num_traces: 100
      expected_new_traces_per_sec: 100
      policies:
        [{ name: latency_policy, type: latency, latency: { threshold_ms: 1 } }]
    memory_limiter: null
  connectors:
    count:
    servicegraph:
      latency_histogram_buckets: [5ms, 30ms, 100ms, 500ms, 2s]
      dimensions:
        - span.kind
      store:
        ttl: 5s
        max_items: 5000
    spanmetrics:
      histogram:
        explicit:
          buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 2s]
      dimensions:
        - name: http.status_code
      dimensions_cache_size: 100000
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
  receivers:
    otlp:
      protocols:
        grpc:
          include_metadata: true
          endpoint: 0.0.0.0:4317
  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      traces:
        exporters:
          - otlp
          - count
          - servicegraph
          - spanmetrics
        processors:
          - memory_limiter
          - batch
        receivers:
          - otlp
      metrics:
        receivers:
          - count
          - servicegraph
          - spanmetrics
        exporters:
          - prometheus

Log output

No response

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

connector/spanmetrics: @albertteoh @kovrus

See Adding Labels via Comments if you do not have permissions to add labels yourself.

fatsheep9146 commented 1 year ago

Does the collector keep restarting?

harshraigroww commented 1 year ago

no, collector is not restarting

aptomaKetil commented 1 year ago

I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.

fatsheep9146 commented 1 year ago

I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.

Could you share your configuration? @aptomaKetil

aptomaKetil commented 1 year ago

Sure @fatsheep9146:

---
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlp/tempo:
    endpoint: tempo-distributor:4317
    compression: gzip
    balancer_name: round_robin
    tls:
      insecure: true
  otlphttp/mimir:
    endpoint: http://mimir-distributor:8080/otlp
    tls:
      insecure: true
    compression: gzip

connectors:
  spanmetrics:
    histogram:
      explicit: null
      exponential:
        max_size: 64
    dimensions:
      - name: http.route
      - name: http.method
      - name: db.system
      - name: service.namespace
    namespace: spanmetrics

processors:
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.node.name
      labels:
        - tag_name: __app
          key: app.kubernetes.io/name
          from: pod
        - tag_name: __app
          key: app
          from: pod
        - tag_name: __app
          key: k8s-app
          from: pod
        - tag_name: service.version
          key: app.kubernetes.io/version
          from: pod
    pod_association:
      - sources:
          - from: connection
  resource:
    attributes:
      - key: service.name
        from_attribute: __app
        action: upsert
      - key: service.instance.id
        from_attribute: k8s.pod.name
        action: upsert
      - key: service.namespace
        from_attribute: k8s.namespace.name
        action: upsert

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors:
        - memory_limiter
        - k8sattributes
        - resource
        - batch
      exporters: [otlp/tempo, spanmetrics]
    metrics:
      receivers: [otlp, spanmetrics]
      processors:
        - memory_limiter
        - k8sattributes
        - resource
        - batch
      exporters: [otlphttp/mimir]
    logs: null

garry-cairns commented 1 year ago

@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.

You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.

attributes/collector_info:
  actions:
    - key: collector.hostname
      value: $HOSTNAME
      action: upsert

equinsuocha commented 1 year ago

@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.

You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.
attributes/collector_info:
  actions:
    - key: collector.hostname
      value: $HOSTNAME
      action: upsert

We actually had this setup initially with spanmetrics processor, but still run into the similar problem with connector. The issue as I see it is that internally it takes into account ALL resource entry labels, disregarding connector configuration, and then sends it down the line to prometheus exporter with only preferred label set, which causes metric collision if resource entry contains some changing value (we had this problem with php and process id, as it starts a new process for every incoming request). Once we sanitized resource entries before spanmetrics connector, the issue got resolved.

mshebeko-twist commented 1 year ago

I experience same behaviour - I have a collector as an DaemonSet and calls_total metrics goes up and down. I've port-forwarded to pod to see what values I get on prometheus exporter and indeed there are some flactuation. In my case its is the same pod on the same host so I'm not sure if its a label set.

Edit: Forgot to mention that if I will reload configs and restart the pod for some time there will be no flactuation in calls_total and it appears later.

dan-corneanu commented 1 year ago

I am seeing the same behaviour with spans that are not at the root of the trace. For root spans all counter metrics are monotonic and going up.

What is the span.kind of your metric @harshraigroww ?

harshraigroww commented 1 year ago

@dan-corneanu i am experiencing this issue with all span_kind

dan-corneanu commented 1 year ago

@harshraigroww what version of otel/opentelemetry-collector-contrib are you using? I have just updated my docker image to use the latest version from dockerhub and at a first glance it seems that I do not have this problem anymore. I'll keep investigating, though.

harshraigroww commented 1 year ago

Collector-contrib version is 0.75.0 I will check with latest version again

mshebeko-twist commented 1 year ago

Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...

@harshraigroww have you found solution/workaround?

harshraigroww commented 1 year ago

no @mshebeko-twist i am using span-metrics processor currently instead of connectors

albertteoh commented 1 year ago

Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...

@harshraigroww have you found solution/workaround?

It looks like there are two competing sets of metrics (under the same metric name and label) being emitted to the metrics storage. Perhaps there are multiple instances of the otel collector running spanmetrics connector?

mshebeko-twist commented 1 year ago

@albertteoh, thanks for advice! Unfortunately I've validated this before upgrade - port forwarded to otel-collector's metrics endpoint which prometheus scraps, and after refreshing couple of times I've seen fluctuation in calls_total metric for problematic service. So at one time I will have a high value that indeed represents amount of calls and other time I will get a low value which is the cause of this issue, so when calculating rate on this metric it will result in really high value...

I have same setup on multiple environments, every environment has multiple instrumented services that are written in different languages/framework. What's interesting is that every environment will have this issue occur for different services, which points me to the fact that its not OTEL instrumentation that's the cause of this issue but connector itself.

P.S. @harshraigroww said that it works well for him using processor. In my case both of them eventually produces this issue.

albertteoh commented 1 year ago

Are there multiple instances of otel-collector pods running behind a service? Even though prometheus is scraping from a single otel-collector port, the service could be load balancing across the otel-collector pods.

These metrics are all held in memory on the otel-collector instance; there's not federation across otel-collector instances.

I wouldn't expect the spanmetrics processor or connector to produce fluctuating metrics like that, especially when we can see a monotonically increasing pattern as from the screenshot above.

mshebeko-twist commented 1 year ago

They are running on each kubernetes node and exposed via nodePort. And for validation I've port forwarded to specific pod to isolate the problem. This is how I've confirmed the fluctuation of the value. So its same OTEL collector, monitoring same pod producing different results.

albertteoh commented 1 year ago

Okay, thanks for checking that.

Is it possible to create a local reproducer through docker containers? You could use https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor as a template.

dan-corneanu commented 1 year ago

Hey, just an idea. Could we somehow make the spanmetrics connector tag its metrics with a uniq UUID? This will allow us to detect if the collector process itself gets restarted or if there are multiple processes sending metrics with the same dimensions. What do you think?

rotscher commented 8 months ago

This is how I can reproduce the issue.

App middleware.tc10java17 (instrumented with javaagent) => otelcol (v0.86.0, as agent) => otelcol (v0.90.1, as gateway) => prometheus

The spanmetricsconnector is configured on the gateway collector.

The agent collector is configured as follows:

(...)
processors:
  transform:
    error_mode: ignore
    trace_statements:
      - context: resource
        statements:
          - set(attributes["provider.observability"], "true")
(...)
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - memory_limiter
        - transform
        - attributes
        - batch
      exporters:
        - otlp/gateway

The following steps are done:

Starting telemetry processing, everything ok
Change attribute provider.observability from true to false and restart agent collector

After the restart the span metrics start going "up and down" (red circle in the screenshot) The issue is resolved as soon as the gateway collector is restarted (blue circle in the screenshot)

Another hint: When the flag resource_to_telemetry_conversion of prometheusremotewrite exporter is enabled then the metrics behave correctly as a new time series is created due to the changed attribute. The fact that the "expired" metric never vanishes seems to be another issue (https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17306).

rotscher commented 8 months ago

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/29711).

Updating to 0.92.0 and configuring this option resolved the issue for me.

chewrocca commented 8 months ago

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711).

Updating to 0.92.0 and configuring this option resolved the issue for me.

Out of curiosity, what value did you set for resource_metrics_key_attributes?

rotscher commented 8 months ago

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711). Updating to 0.92.0 and configuring this option resolved the issue for me.

Out of curiosity, what value did you set for resource_metrics_key_attributes?

    resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name

It is also mentioned in the example in spanmetrics connector's README

mshebeko-twist commented 7 months ago

Thanks @rotscher, from what I see right now calls_total is now stopped fluctuating, after setting resource_metrics_key_attributes as you mentioned to:

resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name

You can see counters are behaving properly now:

mx-psi commented 5 months ago

I filed open-telemetry/opentelemetry.io/issues/4368 since this seems like a common point of confusion, and it can potentially happen with other components. We can consider adding a link to this docs section on the spanmetrics connector page when we fix that.

My recommendation would be to use the resource detection processor or the k8sattributes processor to add an appropriate label.

Could we somehow make the spanmetrics connector tag its metrics with a uniq UUID

If you want to do this, you can use the UUID() function in the transform processor for this if the above solution does not work for you. This runs the risk of producing a cardinality explosion if restarts are happening frequently, so use it at your own risk :) I also think it's less useful than the above suggestion, since the UUID does not have any meaning.

ramanjaneyagupta commented 4 months ago

Hi i am also having similar issue.. and using span-metrics it is keep giving wrong metrics .. This is the issue i raised - https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32043 I tried with central gateway, and also tried with sending the data by routing to a second layer using service name load balancing.. but it is keep giving wrong metrics..

Some help would be apprecaited!

vaibhhavv commented 2 months ago

Hi @harshraigroww any update on this issue? I am also facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase. But with spanmetrics connector, it is observed the graph is not monotonic. As @rotscher suggested to add resource_metrics_key_attributes but still no luck the graph has ups and downs and its not monotonic.

VijayPatil872 commented 2 months ago

Hi @harshraigroww @aptomaKetil as suggested to upsert the collector hostname into spans then using that as a dimension for spanmetrics, the issue of not resolved. also, As @rotscher suggested to add resource_metrics_key_attributes but still the graph has ups and downs and it's not monotonic increase way. here we facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase. But with spanmetrics connector, it is observed the graph is not monotonic.

open-telemetry / opentelemetry-collector-contrib