open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.76k stars 2.19k forks source link

prometheusremotewriteexporter logs spurious errors with WAL configured and no metrics to be sent #24399

Open gburek-fastly opened 1 year ago

gburek-fastly commented 1 year ago

Component(s)

exporter/prometheusremotewrite

What happened?

Description

When the prometheusremotewriteexporter is configured to use a WAL directory and is receiving no metrics to export, it will emit error logs that error processing WAL entries where there are, in fact, none to be processed. The exporter should fall back to using fsnotify.NewWatcher() but it does not.

Steps to Reproduce

run otelcol-contrib with included config

Expected Result

no errors and no error logs

Actual Result

error logs emitted every 5s

Collector version

v0.81.0

Environment information

Environment

OS: osx, colima, docker docker run --rm --mount type=bind,source=$(pwd),target=/etc/otelcol/ otel/opentelemetry-collector-contrib:latest --config /etc/otelcol/config.yaml

OpenTelemetry Collector configuration

exporters:
  prometheusremotewrite:
    endpoint: "http://host.docker.internal:9090/api/v1/push"
    wal:
      directory: /etc/otelcol/
extensions:
  health_check: {}
  memory_ballast:
    size_in_percentage: 40
processors:
  batch: {}
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
service:
  extensions:
  - health_check
  - memory_ballast
  pipelines:
    metrics:
      exporters:
      - prometheusremotewrite
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp

Log output

2023-07-19T23:00:48.424Z        info    service/service.go:158  Everything is ready. Begin running and processing data.
2023-07-19T23:00:52.535Z        error   prw.wal prometheusremotewriteexporter@v0.77.0/wal.go:182        error processing WAL entries    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "not found"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).run.func1
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter@v0.77.0/wal.go:182
2023-07-19T23:00:56.659Z        error   prw.wal prometheusremotewriteexporter@v0.77.0/wal.go:182        error processing WAL entries    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "not found"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).run.func1
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter@v0.77.0/wal.go:182
2023-07-19T23:01:00.792Z        error   prw.wal prometheusremotewriteexporter@v0.77.0/wal.go:182        error processing WAL entries    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "not found"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).run.func1
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter@v0.77.0/wal.go:182

Additional context

It appears that the error check in *prweWAL.readPrompbFromWAL for wal.ErrNotFound is not correctly catching it, as the "error": "not found" text is rising up to the "error processing WAL entries" error logging.

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/wal.go#L358

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 commented 9 months ago

I was able to reproduce and debug this locally with the shared config.

The error is actually being caught as expected, it's just that after 12 retries the read fails and returns the err variable, which is simply the last err seen.

After every read attempt, there is a sleep call that increases duration with each failure. This is where the ~4 second delay comes in, because after all of the delays are done it will start over from the beginning, disregarding the default (or set) wal.TruncateFrequency

I agree this is annoying log spam, but I don't have enough expertise here to confidently say what to do about it. I think it's a good idea in most cases to log an error, since you'd theoretically expect to be getting data if you're running the collector. 4 seconds without data in a regular use case seems like a problem, so I think we do the right thing logging.

I'll defer to code owners but I think at most we may decrease log level to WARN, I don't think we would want to ignore this kind of error (unless I'm missing something).

Apologies for the delayed response!

github-actions[bot] commented 7 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

toughnoah commented 6 months ago

@crobert-1 However, if I configure the wal, no metrics will be sent, and I can just find this error log. I think it is not merely the problems of logs.

github-actions[bot] commented 4 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

blockloop commented 2 months ago

@crobert-1 However, if I configure the wal, no metrics will be sent, and I can just find this error log. I think it is not merely the problems of logs.

I can confirm the same problem. No metrics are sent and I see this "not found" log message.

github-actions[bot] commented 3 weeks ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.