Closed duc12597 closed 2 months ago
Pinging code owners:
connector/spanmetrics: @portertech @Frapschen
See Adding Labels via Comments if you do not have permissions to add labels yourself.
I have a few questions that might help us track down this issue: Is there any chance your collector is restarting at these points? Are you running just one collector or many in a gateway mode?
I'm running the collector as a deployment, and have tried both 1 and 3 replicas. The collector did not restart, I had to terminate the pods to keep exporting the metrics
I see... honestly at this point I don't quite know what would cause it to eventually stop emitting metrics at all - that's the symptom that is really throwing me for a loop.
Are you still having these problems? Can you try increasing resource_metrics_cache_size? The thought is that this might prevent evictions which might prevent the resets.
Other things that might help us track down this problem - what is the count of the unique series within count_total over time? Are there resets happening for a series that the TSDB has already gotten or are there entirely new series?
This is the count(calls_total)
at approximately the time the counter decreases
Further observation shows that out of 3 metrics receivers in my collector configuration, kafka/metrics
& prometheus
worked fine:
Only metrics from spanmetrics
failed:
thanks for your update. Did you try changing the cache size? I'm honestly a little stumped - any ideas @portertech @Frapschen ?
With the current config the connector will permanently cache every series it sees and send them all during each flush, even the ones where nothing's changed
So eventually the payload flushed to prometheusremotewrite
gets so large that the remote write request times out (i.e. context deadline exceeded
is a timeout) and likely the request gets rejected by the remote write target because of the size
Permanent error: context deadline exceeded"}], "dropped_items": 58510}
Possible things that could help are:
metrics_expiration
on the connector so that infrequently updated span metrics are removed. Then you have to deal with prometheus counter resetsbatch
processor and/or prometheusremote
's built-in configI set metrics_expiration: 30m
, the metrics still disappeared altogther. It returned after ~6 hours, but somehow the collectors did not restart.
@duc12597 Have your try to switch push
model to pull
?. replace your prometheusremotewrite
to prometheusexporter
.
@duc12597 Have your try to switch
push
model topull
?. replace yourprometheusremotewrite
toprometheusexporter
.
We will consider this option. As of now the collector has been running for 2 weeks without any errors, although there are still counter fluctuations. I'm not sure if it's thanks to any changes on our side. I will close this issue for now and will re-open in the future if this problem resurface.
This is my complete collector manifest:
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: 03-sink-metric-prometheus
spec:
image: mirror.gcr.io/otel/opentelemetry-collector-contrib:0.102.0
replicas: 5
nodeSelector:
mycompany.com/service: observability
kubernetes.io/arch: amd64
tolerations:
- effect: NoSchedule
key: mycompany.com/service
value: observability
operator: Equal
config: |
receivers:
prometheus:
config:
scrape_configs:
- job_name: 03-sink-metric-prometheus
scrape_interval: 10s
static_configs:
- targets: ['127.0.0.1:8888']
kafka/traces:
protocol_version: 3.3.1
brokers:
- b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
- b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
auth:
tls:
insecure: true
topic: otlp_spans
group_id: 03-sink-metric-prometheus
kafka/metrics:
protocol_version: 3.3.1
brokers:
- b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
- b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
auth:
tls:
insecure: true
topic: otlp_metrics
group_id: 03-sink-metric-prometheus
processors:
filter:
error_mode: ignore
metrics:
datapoint:
- 'IsMatch(attributes["http.target"], ".*.(css|js)")'
transform:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
# reduce the cardinality of metrics with params
- replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
connectors:
spanmetrics:
dimensions:
- name: http.method
- name: http.target
- name: http.status_code
- name: host.name
- name: myCustomLabel
exclude_dimensions:
- span.kind
- span.name
- status.code
exemplars:
enabled: true
metrics_flush_interval: 15s
metrics_expiration: 1h
resource_metrics_key_attributes:
- service.name
- telemetry.sdk.language
- telemetry.sdk.name
resource_metrics_cache_size: 10000
exporters:
debug:
prometheusremotewrite:
endpoint: http://mimir-nginx/api/v1/push
send_metadata: true
service:
telemetry:
metrics:
address: 127.0.0.1:8888
level: detailed
extensions:
- sigv4auth
pipelines:
traces:
receivers:
- kafka/traces
processors: []
exporters:
- spanmetrics
metrics:
receivers:
- kafka/metrics
- prometheus
- spanmetrics
processors:
- filter
- transform
exporters:
- debug
- prometheusremotewrite
env:
- name: GOMEMLIMIT
value: 1640MiB # 80% of resources.limits.memory
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 2Gi
@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34126#issuecomment-2270690510 to fix it.
@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see #34126 (comment) to fix it.
If I understand correctly, this will add a UUID as a label for every metric generated by each collector pod. Will this explode the cardinality? Why does a UUID solve the fluctuation? Can you give an example config?
Thanks a ton.
Component(s)
connector/spanmetrics
What happened?
Description
Our collector receives OTLP traces from Kafka, convert them into metrics and export to a TSDB. After a certain period of collector uptime (24-48 hours), the generated
calls_total
counter suffers a significant drop in value. Eventually no more metrics are exported.Steps to Reproduce
Follow the below collector configuration.
Expected Result
The
calls_total
counter is ever increasing.Actual Result
The
calls_total
counter drops then disappears.Collector version
v0.101.0
Environment information
Environment
AWS EKS 1.24
OpenTelemetry Collector configuration
Log output
Additional context