open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.72k stars 2.15k forks source link

[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

Open ImDevinC opened 1 year ago

ImDevinC commented 1 year ago

What happened?

Description

When using the prometheusremotewrite exporter with the WAL enabled, no metrics are sent from the collector to the remote write destination.

Steps to Reproduce

Using the config in the config section below can reproduce this error by sending metrics to this collector. Disabling the WAL section causes all metrics to be sent properly.

Expected Result

Prometheus metrics should appear in the remote write destination.

Actual Result

No metrics were sent to the remote write destination.

Collector version

0.62.1

Environment information

Environment

AWS bottlerocket running otel/opentelemetry-collector-contrib:0.36.3 docker image

OpenTelemetry Collector configuration

exporters:
  logging:
    loglevel: info
  prometheusremotewrite:
    endpoint: http://thanos-receive-distributor:19291/api/v1/receive
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: false
      initial_interval: 5s
      max_elapsed_time: 10s
      max_interval: 10s
    target_info:
      enabled: false
    timeout: 15s
    tls:
      insecure: true
    wal:
      buffer_size: 100
      directory: /data/prometheus/wal
      truncate_frequency: 45s
extensions:
  health_check: {}
  memory_ballast: {}
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
processors:
  batch: {}
  batch/metrics:
    send_batch_max_size: 500
    send_batch_size: 500
    timeout: 180s
  memory_limiter:
    check_interval: 5s
    limit_mib: 4915
    spike_limit_mib: 1536
receivers:
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${MY_POD_IP}:8888
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  - pprof
  - zpages
  pipelines:
    logs:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
    metrics:
      exporters:
      - prometheusremotewrite
      processors:
      - batch/metrics
      receivers:
      - otlp
    traces:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
      - jaeger
      - zipkin
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Log output

No response

Additional context

From debugging, this looks to be a deadlock between persistToWAL() and readPrompbFromWAL(), but I'm not 100% certain

HudsonHumphries commented 1 year ago

+1 I am also having issues when using the WAL for the prometheusremotewrite exporter. The only way I could get it to export metrics was by setting the buffer_size to 1 and exporting 1 metric at a time is not an option

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

ImDevinC commented 1 year ago

We have moved off of the prometheusremotewrite and it looks like there's no action on this. Closing the ticket

kovrus commented 1 year ago

@ImDevinC Can you reopen this issue? It has to be investigated and fixed anyways.

ckt114 commented 1 year ago

Any update on this? I'm seeing the same issue. As soon as I enable WAL no metric is sent out.

gouthamve commented 1 year ago

This is a deadlock. From what I can see the following is happening:

readPrompbFromWAL:

  1. Takes mutex
  2. Reads data
  3. If data is found, returns

The problem is when data is not found, it watches the file:

  1. Takes mutex
  2. Reads data
  3. If no data is found, watch the file for updates
  4. Get blocked because the mutex is taken and writes can't happen.

Removing the file watcher fixes the issue.


However, it exposes another bug, we keep reading the same data and resending the requests again and again. I think the WAL implementation needs a closer look.

kumar0204 commented 1 year ago

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.

flow: App-->Otel Agen--> VictoriaMetrics--> grafana use case: I want to implement persistence of metrics in the event of any failures. ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.

Please advise if any better solution available for my use case. Please someone help to fix the issue.

frzifus commented 1 year ago

I can confirm the same. To be able to test it faster, I moved the relevant parts into a config file that works locally.

Details: Locally tested config with reported settings ```yaml --- exporters: logging: verbosity: detailed prometheusremotewrite: endpoint: http://127.0.0.1:9090/api/v1/write remote_write_queue: enabled: true num_consumers: 1 queue_size: 5000 resource_to_telemetry_conversion: enabled: true retry_on_failure: enabled: false initial_interval: 5s max_elapsed_time: 10s max_interval: 10s target_info: enabled: false timeout: 15s tls: insecure: true wal: buffer_size: 100 directory: ./wal truncate_frequency: 45s extensions: health_check: {} memory_ballast: {} pprof: endpoint: :1888 processors: batch: {} batch/metrics: send_batch_max_size: 500 send_batch_size: 500 timeout: 180s memory_limiter: check_interval: 5s limit_mib: 4915 spike_limit_mib: 1536 receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 service: extensions: [health_check,pprof] pipelines: metrics: receivers: [otlp] processors: [batch/metrics] exporters: [logging,prometheusremotewrite] telemetry: metrics: address: 0.0.0.0:8888 ```

Then I used telemetrygen to generate some data. The collector starts to hang and needs to be force killed.

telemetrygen metrics --otlp-insecure --duration 45s --rate 500

But using this patch https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/20875 from @sh0rez I start to receive metrics:

# HELP rwrecv_requests_total 
# TYPE rwrecv_requests_total counter
rwrecv_requests_total{code="200",method="GET",path="/metrics",remote="localhost"} 3
rwrecv_requests_total{code="200",method="POST",path="/api/v1/write",remote="localhost"} 29
# HELP rwrecv_samples_received_total 
# TYPE rwrecv_samples_received_total counter
rwrecv_samples_received_total{remote="localhost"} 7514
zakariais commented 1 year ago

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.

flow: App-->Otel Agen--> VictoriaMetrics--> grafana use case: I want to implement persistence of metrics in the event of any failures. ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.

Please advise if any better solution available for my use case. Please someone help to fix the issue.

@kumar0204 I'm looking to do the same thing for OTEL to retry failed in case of backend goes down, did you find something for this like persistence or anything with remote write exporter?

frzifus commented 1 year ago

@zakariais is the filestorage extension what you are looking for?

zakariais commented 1 year ago

@zakariais is the filestorage extension what you are looking for?

@frzifus does the file storage extension work with prometheus remote write exporter? I didn't see that it works in the README of it.

kumar0204 commented 1 year ago

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard. flow: App-->Otel Agen--> VictoriaMetrics--> grafana use case: I want to implement persistence of metrics in the event of any failures. ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana. Please advise if any better solution available for my use case. Please someone help to fix the issue.

@kumar0204 I'm looking to do the same thing for OTEL to retry failed in case of backend goes down, did you find something for this like persistence or anything with remote write exporter?

I have 2 types of persistence used in our set up. my flow is like this Service/Application-->Otel Agent with filestorage extention used for persistence -> Otel collector /Gateway with WriteAheadLog using prometheusremotewrite for persistence --> Victoria metrics ( SRE Back end) --> Grafana

1 use case: in the above set up metrics are stored at agent end using filestorage extention, in case if Gateway is down then metrics are replayed from otel agent side. 2nd use case:in case Victoriametrics/Promethus is down metrics are stored at WAL log, once the SRE back end is up and running metrics are replayed from gateway. my second use case has issue with WAL, when WAL enabled metrics are not reaching grafana. hope you understood the issue clearly.

github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions[bot] commented 8 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions[bot] commented 8 months ago

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil. See Adding Labels via Comments if you do not have permissions to add labels yourself.

cheskayang commented 7 months ago

i have a similar setup as @kumar0204 and running into the exact same issue with enabling wal on prometheusremotewrite

frzifus commented 7 months ago

There is actually already a fix that has to be polished: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/20875

Do you want to work on that @cheskayang ?

cheskayang commented 6 months ago

@frzifus thx for letting me know! i saw you opened a pr after this comment, but it's stale, https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/29297

do you still plan to ship the fix?

devyanigoil commented 4 months ago

@kumar0204 Even i have a similar setup. Were you able to solve the WAL issue?

diranged commented 2 months ago

Ping ... we'd really like to see this get fixed as well... :/

sh0rez commented 2 months ago

i've reopened and rebased https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/20875 which will fix this

a-shoemaker commented 3 weeks ago

prometheusremotewrite with WAL enabled just flat out doesn't work. I've never seen it work anyway. There has been a PR out there to fix for over a year it looks like. Curious what the plan is here? Merge that, get a different fix, just remove WAL, or just leave it out there indifferently not working at all?