Closed carlosmt86-hub closed 8 months ago
I can confirm I also see the same issue on version 0.93.0. I don't scrape the metrics in the same way in otelc though, as I use telegraf for that. My metrics stops just after a couple of minutes, tops.
This worked fine on version 0.91.0 (0.92.0 doesn't work with AWS ALB due to a bug in grpc-go so not sure if it was a problem in that version or not)
I get the same error from the metrics endpoint:
$ curl http://localhost:8888/metrics
An error has occurred while serving metrics:
collected metric "otelcol_exporter_queue_size" { label:{name:"exporter" value:"loadbalancing"} label:{name:"service_instance_id" value:"5fa99331-b105-4dc7-a62d-b850fa048393"} label:{name:"service_name" value:"otelcol-contrib"} label:{name:"service_version" value:"0.93.0"} gauge:{value:0}} was collected before with the same name and label values
I run this on AL2023, (AWS EC2 arm instance) with this config:
receivers:
otlp:
protocols:
grpc:
exporters:
loadbalancing:
protocol:
otlp:
timeout: 5s
sending_queue:
queue_size: 10000
tls:
insecure: true
resolver:
dns:
hostname: otelsampler.etc
timeout: 3s
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 30
batch:
send_batch_max_size: 12288
timeout: 5s
extensions:
health_check:
zpages:
service:
extensions: [zpages, health_check]
pipelines:
traces/otlp:
receivers: [otlp]
exporters: [loadbalancing]
processors: [memory_limiter, batch]
metrics:
receivers: [otlp]
exporters: [loadbalancing]
processors: [memory_limiter, batch]
logs:
receivers: [otlp]
exporters: [loadbalancing]
processors: [memory_limiter, batch]
I do however have other collectors running which above collectors send to, for testing out sampling, which also runs in a similar way, and on version 0.93.0, where the metrics still works fine. It has this config:
receivers:
otlp:
protocols:
grpc:
exporters:
file/otlpdebug:
path: /tmp/otlpdebug
rotation:
max_megabytes: 10
max_days: 2
max_backups: 4
localtime: true
otlp/apmserver:
endpoint: "https://a.apm.etc:8200"
retry_on_failure:
max_elapsed_time: 1000s
sending_queue:
queue_size: 5000
timeout: 10s
headers:
Authorization: "Bearer NA"
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 30
batch:
send_batch_size: 5120
send_batch_max_size: 5120
timeout: 5s
tail_sampling:
decision_wait: 10s
num_traces: 20000
expected_new_traces_per_sec: 1000
policies: [
{
name: errors-policy,
type: status_code,
status_code: { status_codes: [ERROR] }
},
{
name: latency-policy,
type: latency,
latency: { threshold_ms: 500 }
},
{
name: randomized-policy,
type: probabilistic,
probabilistic: { sampling_percentage: 10 }
},
{
# Always sample if the force_sample attribute is set to true
name: force-sample-policy,
type: boolean_attribute,
boolean_attribute: { key: force_sample, value: true }
},
]
extensions:
health_check:
zpages:
service:
extensions: [zpages, health_check]
pipelines:
traces/otlp:
receivers: [otlp]
exporters: [file/otlpdebug, otlp/apmserver]
processors: [memory_limiter, tail_sampling, batch]
metrics:
receivers: [otlp]
exporters: [otlp/apmserver]
processors: [memory_limiter, batch]
logs:
receivers: [otlp]
exporters: [otlp/apmserver]
processors: [memory_limiter, batch]
@carlosmt86-hub Which PR was this issue fixed with? I'm seeing a similar problem with version 0.92.
Component(s)
No response
What happened?
Description
After some minutes (30-45) telemetry metrics stop working we can see this error doing a curl -v 127.0.0.1:8888/metrics: collected metric "otelcol_exporter_queue_size" { label:{name:"exporter" value:"datadog"} label:{name:"service_instance_id" value:"a2cdae11-0e8f-4b05-a31a-9de31429d8a7"} label:{name:"service_name" value:"otelcol-contrib"} label:{name:"service_version" value:"0.93.0"} gauge:{value:0}} was collected before with the same name and label values
On the logs I have: warn internal/transaction.go:123 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1706557569579, "target_labels": "{name=\"up\", instance=\"127.0.0.1:8888\", job=\"otel-collector\"}"}
Steps to Reproduce
Enable telemetry and scrape it with Prometheus receiver, after some minutes stop working
Expected Result
Telemetry metrics continue working.
Actual Result
Telemetry metrics stop working after some minutes.
Collector version
v0.93.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Additional context
No response