open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.9k stars 2.27k forks source link

prometheusremotewrite emits noisy errors on empty data points #4972

Open nicks opened 3 years ago

nicks commented 3 years ago

Describe the bug Here's the error message:

2020-12-01T00:03:54.421Z    error   exporterhelper/queued_retry.go:226  Exporting failed. The error is not retryable. Dropping data.    {"component_kind": "exporter", "component_type": "prometheusremotewrite", "component_name": "prometheusremotewrite", "error": "Permanent error: [Permanent error: nil data point. image_build_count is dropped; Permanent error: nil data point. image_build_duration_dist is dropped]", "dropped_items": 2}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
    /home/circleci/project/exporter/exporterhelper/queued_retry.go:226
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
    /home/circleci/project/exporter/exporterhelper/metricshelper.go:115
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
    /home/circleci/project/exporter/exporterhelper/queued_retry.go:128
github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1
    /home/circleci/go/pkg/mod/github.com/jaegertracing/jaeger@v1.21.0/pkg/queue/bounded_queue.go:77

Here's the metric descriptor emitted by logging exporter for the same metric:

Metric #0
Descriptor:
     -> Name: image_build_count
     -> Description: Image build count
     -> Unit: ms
     -> DataType: IntSum
     -> IsMonotonic: true
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE

I don't know enough about the contracts here to know if this is a bug in the opencensus code that I'm using to send the metric, or in the batcher, or in the prometheusremotewrite exporter, or something else entirely.

What did you expect to see? No error messages

What did you see instead? An error message

What version did you use? Docker image: otel/opentelemetry-collector:0.15.0

What config did you use?

    extensions:
      health_check:
      pprof:
        endpoint: 0.0.0.0:1777
      zpages:
        endpoint: 0.0.0.0:55679

    receivers:
      opencensus:
        endpoint: "0.0.0.0:55678"

    processors:
      memory_limiter:
        check_interval: 5s
        limit_mib: 4000
        spike_limit_mib: 500
      batch:

    exporters:
      logging:
        loglevel: debug
      prometheusremotewrite:
        endpoint: "http://.../api/v1/prom/write?db=tilt"
        insecure: true

    service:
      extensions: [health_check, pprof, zpages]
      pipelines:
        metrics:
          receivers: [opencensus]
          exporters: [logging, prometheusremotewrite]

Environment OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

Additional context Add any other context about the problem here.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions[bot] commented 1 year ago

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

kovrus commented 1 year ago

The part about the noisy error message should be fixed now, each type of metric has a following condition to handle the case with empty data points.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

kovrus commented 1 year ago

@nicks do you still have this issue?

nicks commented 1 year ago

nope, let's close it!

gogreen53 commented 1 year ago

I think there may have been a regression because I'm seeing this behavior and it makes the logs completely unusable for debugging. Version: opentelemetry-collector-contrib:0.75.0

2023-06-27T17:55:17.713Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: empty data points. xxxx_download_size is dropped", "dropped_items": 16} 53 go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send 52 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/queued_retry.go:401 51 go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send 50 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/metrics.go:136 49 go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1 48 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/queued_retry.go:205 47 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1 46 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/internal/bounded_memory_queue.go:58

Boeller666 commented 1 year ago

Updated to the latest version of the opentelemetry-operator (0.33.0), but same here:

2023-07-05 10:38:26 | 2023-07-05T08:38:26.666Z  error   exporterhelper/queued_retry.go:391  Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: empty data points. XXX is dropped", "dropped_items": 38}
-- | --
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/queued_retry.go:391
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/metrics.go:125
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/queued_retry.go:195
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/internal/bounded_memory_queue.go:47
github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cyberw commented 9 months ago

@crlnz ’s PR seemed to have a nice fix/workaround but it was closed.

We’re still experiencing this issue (on v0.83, but we'll try to update to v0.89 and see if that helps). @Aneurysm9 @rapphil

2023-11-24T14:25:25.375+0100 error exporterhelper/queued_retry.go:391 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite/mimir", "error": "Permanent error: empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped", "dropped_items": 12077}

crlnz commented 9 months ago

@cyberw We're currently in the process of completing the EasyCLA internally, so this will be re-opened eventually. It's taking a little longer because we need to poke around regarding this issue. If anyone that has already signed the EasyCLA would like to take ownership of these changes, please feel free to fork my changes and open a new PR.

github-actions[bot] commented 7 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

esuh-descript commented 6 months ago

We are still running into this issue on 0.95.0:

2024-02-21T23:08:53.528Z    error   exporterhelper/queued_retry.go:391  Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: empty data points. [metric_name] is dropped", "dropped_items": 19}
alxbl commented 5 months ago

Hello, I've done some investigation on my end because this issue is still affecting us in 0.97 as well. I think the main problem is that there is an inconsistency in what receivers push into the pipeline and what (some) exporters expect.

In our case, the issue is happening with the windowsperfcounters receiver because the way it works is by pre-allocating all metrics in the OTLP object before attempting to scrape. If it fails to scrape (usually because it couldn't open the counter in the first place), it does not remove the metric from the OTLP message, but also does not add any data points. The final message is then pushed down the pipeline, processed, and when prometheusremotewrite finally receives it, it loops through the metrics and complains for every metric without data points.

I haven't checked other receivers, but any receiver that does something like this will cause prometheusremotewrite to log that error.

There are a few ways that I can think of for fixing this:

  1. Receivers should not be allowed to send metrics with empty points down the pipeline (Not sure how this plays with the spec though)
  2. There could be a processor that drops empty metrics? (maybe transform/filter can already do this?)
  3. Exporters should silently ignore empty metrics that are not a direct result of their manipulation of the data

I personally feel like option 1 is the best, as it reduces unnecessary processing/exporting work. Option 2 might be a decent temporary workaround, and option 3 seems like it could lead down a path of undetected data loss.

In the broader sense, there is also the problem of an OTLP client pushing a message with empty metrics. In that case it's not clear whether OTLP receiver should reject the message as malformed, or drop the empty metrics before pushing them into the pipeline. (I haven't checked, but if this is already specified, then receivers should probably implement a similar behavior)

In my specific case of the Windows Perf Counters, the error is already being logged by the receiver (once) at startup, and then results in one error from prometheusremotewrite per scrape cycle.


My plan is to open a PR that fixes the scrape behavior of windowsperfcounters and link it to this issue, but it will not be a silver bullet. I think the community/maintainers will need to decide how we want to handle empty points in general for this to be fully fixed.

github-actions[bot] commented 3 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

alxbl commented 3 months ago

/label -Stale

The PR for fixing windowsperfcounters (#32384) is pending review/merge, but this issue still needs to be addressed on a receiver-by-receiver basis, unless prometheusremotewrite decides that empty datapoints are not a log-worthy error (maybe a counter instead?)

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.