Open harshraigroww opened 1 year ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Does the collector keep restarting?
no, collector is not restarting
I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.
I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.
Could you share your configuration? @aptomaKetil
Sure @fatsheep9146:
---
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
otlp/tempo:
endpoint: tempo-distributor:4317
compression: gzip
balancer_name: round_robin
tls:
insecure: true
otlphttp/mimir:
endpoint: http://mimir-distributor:8080/otlp
tls:
insecure: true
compression: gzip
connectors:
spanmetrics:
histogram:
explicit: null
exponential:
max_size: 64
dimensions:
- name: http.route
- name: http.method
- name: db.system
- name: service.namespace
namespace: spanmetrics
processors:
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.node.name
labels:
- tag_name: __app
key: app.kubernetes.io/name
from: pod
- tag_name: __app
key: app
from: pod
- tag_name: __app
key: k8s-app
from: pod
- tag_name: service.version
key: app.kubernetes.io/version
from: pod
pod_association:
- sources:
- from: connection
resource:
attributes:
- key: service.name
from_attribute: __app
action: upsert
- key: service.instance.id
from_attribute: k8s.pod.name
action: upsert
- key: service.namespace
from_attribute: k8s.namespace.name
action: upsert
service:
pipelines:
traces:
receivers: [otlp]
processors:
- memory_limiter
- k8sattributes
- resource
- batch
exporters: [otlp/tempo, spanmetrics]
metrics:
receivers: [otlp, spanmetrics]
processors:
- memory_limiter
- k8sattributes
- resource
- batch
exporters: [otlphttp/mimir]
logs: null
@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.
You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.
attributes/collector_info:
actions:
- key: collector.hostname
value: $HOSTNAME
action: upsert
@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.
You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.
attributes/collector_info: actions: - key: collector.hostname value: $HOSTNAME action: upsert
We actually had this setup initially with spanmetrics processor, but still run into the similar problem with connector. The issue as I see it is that internally it takes into account ALL resource entry labels, disregarding connector configuration, and then sends it down the line to prometheus exporter with only preferred label set, which causes metric collision if resource entry contains some changing value (we had this problem with php and process id, as it starts a new process for every incoming request). Once we sanitized resource entries before spanmetrics connector, the issue got resolved.
I experience same behaviour - I have a collector as an DaemonSet and calls_total
metrics goes up and down.
I've port-forwarded to pod to see what values I get on prometheus exporter and indeed there are some flactuation.
In my case its is the same pod on the same host so I'm not sure if its a label set.
Edit:
Forgot to mention that if I will reload configs and restart the pod for some time there will be no flactuation in calls_total
and it appears later.
I am seeing the same behaviour with spans that are not at the root of the trace. For root spans all counter metrics are monotonic and going up.
What is the span.kind
of your metric @harshraigroww ?
@dan-corneanu i am experiencing this issue with all span_kind
@harshraigroww what version of otel/opentelemetry-collector-contrib are you using? I have just updated my docker image to use the latest version from dockerhub and at a first glance it seems that I do not have this problem anymore. I'll keep investigating, though.
Collector-contrib version is 0.75.0 I will check with latest version again
Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...
@harshraigroww have you found solution/workaround?
no @mshebeko-twist i am using span-metrics processor currently instead of connectors
Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...
@harshraigroww have you found solution/workaround?
It looks like there are two competing sets of metrics (under the same metric name and label) being emitted to the metrics storage. Perhaps there are multiple instances of the otel collector running spanmetrics connector?
@albertteoh, thanks for advice! Unfortunately I've validated this before upgrade - port forwarded to otel-collector's metrics endpoint which prometheus scraps, and after refreshing couple of times I've seen fluctuation in calls_total metric for problematic service. So at one time I will have a high value that indeed represents amount of calls and other time I will get a low value which is the cause of this issue, so when calculating rate
on this metric it will result in really high value...
I have same setup on multiple environments, every environment has multiple instrumented services that are written in different languages/framework. What's interesting is that every environment will have this issue occur for different services, which points me to the fact that its not OTEL instrumentation that's the cause of this issue but connector itself.
P.S. @harshraigroww said that it works well for him using processor. In my case both of them eventually produces this issue.
Are there multiple instances of otel-collector pods running behind a service? Even though prometheus is scraping from a single otel-collector port, the service could be load balancing across the otel-collector pods.
These metrics are all held in memory on the otel-collector instance; there's not federation across otel-collector instances.
I wouldn't expect the spanmetrics processor or connector to produce fluctuating metrics like that, especially when we can see a monotonically increasing pattern as from the screenshot above.
They are running on each kubernetes node and exposed via nodePort
. And for validation I've port forwarded to specific pod
to isolate the problem. This is how I've confirmed the fluctuation of the value. So its same OTEL collector, monitoring same pod producing different results.
Okay, thanks for checking that.
Is it possible to create a local reproducer through docker containers? You could use https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor as a template.
Hey, just an idea. Could we somehow make the spanmetrics connector
tag its metrics with a uniq UUID? This will allow us to detect if the collector
process itself gets restarted or if there are multiple processes sending metrics with the same dimensions. What do you think?
This is how I can reproduce the issue.
App middleware.tc10java17
(instrumented with javaagent) => otelcol
(v0.86.0, as agent) => otelcol
(v0.90.1, as gateway) => prometheus
The spanmetricsconnector is configured on the gateway collector.
The agent collector is configured as follows:
(...)
processors:
transform:
error_mode: ignore
trace_statements:
- context: resource
statements:
- set(attributes["provider.observability"], "true")
(...)
pipelines:
traces:
receivers:
- otlp
processors:
- memory_limiter
- transform
- attributes
- batch
exporters:
- otlp/gateway
The following steps are done:
provider.observability
from true
to false
and restart agent collectorAfter the restart the span metrics start going "up and down" (red circle in the screenshot) The issue is resolved as soon as the gateway collector is restarted (blue circle in the screenshot)
Another hint: When the flag resource_to_telemetry_conversion
of prometheusremotewrite
exporter is enabled then the metrics behave correctly as a new time series is created due to the changed attribute. The fact that the "expired" metric never vanishes seems to be another issue (https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17306).
As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes
(See https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/29711).
Updating to 0.92.0 and configuring this option resolved the issue for me.
As of version 0.92.0 there is a new configuration option
resource_metrics_key_attributes
(See #29711).Updating to 0.92.0 and configuring this option resolved the issue for me.
Out of curiosity, what value did you set for resource_metrics_key_attributes
?
As of version 0.92.0 there is a new configuration option
resource_metrics_key_attributes
(See #29711). Updating to 0.92.0 and configuring this option resolved the issue for me.Out of curiosity, what value did you set for
resource_metrics_key_attributes
?
resource_metrics_key_attributes:
- service.name
- telemetry.sdk.language
- telemetry.sdk.name
It is also mentioned in the example in spanmetrics connector's README
Thanks @rotscher, from what I see right now calls_total
is now stopped fluctuating, after setting resource_metrics_key_attributes as you mentioned to:
resource_metrics_key_attributes:
- service.name
- telemetry.sdk.language
- telemetry.sdk.name
You can see counters are behaving properly now:
I filed open-telemetry/opentelemetry.io/issues/4368 since this seems like a common point of confusion, and it can potentially happen with other components. We can consider adding a link to this docs section on the spanmetrics connector page when we fix that.
My recommendation would be to use the resource detection processor or the k8sattributes processor to add an appropriate label.
Could we somehow make the spanmetrics connector tag its metrics with a uniq UUID
If you want to do this, you can use the UUID()
function in the transform processor for this if the above solution does not work for you. This runs the risk of producing a cardinality explosion if restarts are happening frequently, so use it at your own risk :) I also think it's less useful than the above suggestion, since the UUID does not have any meaning.
Hi i am also having similar issue.. and using span-metrics it is keep giving wrong metrics .. This is the issue i raised - https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32043 I tried with central gateway, and also tried with sending the data by routing to a second layer using service name load balancing.. but it is keep giving wrong metrics..
Some help would be apprecaited!
Hi @harshraigroww any update on this issue?
I am also facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase.
But with spanmetrics connector, it is observed the graph is not monotonic.
As @rotscher suggested to add resource_metrics_key_attributes
but still no luck the graph has ups and downs and its not monotonic.
Hi @harshraigroww @aptomaKetil as suggested to upsert the collector hostname into spans then using that as a dimension for spanmetrics, the issue of not resolved. also, As @rotscher suggested to add resource_metrics_key_attributes but still the graph has ups and downs and it's not monotonic increase way. here we facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase. But with spanmetrics connector, it is observed the graph is not monotonic.
Component(s)
connector/spanmetrics
What happened?
Description
I am using spanmetrics connector to generate metrics from span. The
calls
metrics is counter metrics so it's value should always increase but i can see graph going up and down of metrics generated by this connectorSteps to Reproduce
Using spanmetrics connector config and passing it as exporter in trace pipeline and receiving it in metrics pipeline.
Expected Result
calls and duration_count is counter metrics so there value should always increase
Actual Result
When graph ploted using these metrics it's going up and down
Collector version
0.75.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response