Exporter prometheusremotewrite keeps sending data for 5m while receiver has only 1 data point

cbos commented 1 year ago

Component(s)

exporter/prometheusremotewrite

What happened?

Description

prometheusremotewrite keeps sending data while the receiver only provides data once in a while or stopped delivering data.

What is the problem with that:

You cannot see how often a certain data point is actually send
If a instrumented app stops sending metrics data to the OLTP endpoint, the prometheusremotewrite keeps sending data. You will only notice the crash after 5 minutes! Only after that the data is missing in the graphs.

If you use the httpreceiver is will first report:

httpcheck.status{http.status_class:2xx, http.status_code:200,...} = 1
httpcheck.status{http.status_class:5xx, http.status_code:200,...} = 0

As soon as it gives problems the new data is from the httpreceiver will be like:

httpcheck.status{http.status_class:2xx, http.status_code:500,...} = 0
httpcheck.status{http.status_class:5xx, http.status_code:500,...} = 1

But the actual send by prometheusremotewrite for 5 minutes.

httpcheck.status{http.status_class:2xx, http.status_code:200,...} = 1
httpcheck.status{http.status_class:5xx, http.status_code:200,...} = 0
httpcheck.status{http.status_class:2xx, http.status_code:500,...} = 0
httpcheck.status{http.status_class:5xx, http.status_code:500,...} = 1

As soon as this is flacky, you will not see the switches either.

Steps to Reproduce

It is easy to reproduce with influxdb receiver (as provided in the separate config)

curl --request POST "http://localhost:8086/api/v2/write?precision=ns" \
  --header "Content-Type: text/plain; charset=utf-8" \
  --header "Accept: application/json" \
  --data-binary "
    airSensors,sensor_id=TLM0201 temperature=75.97038159354763 $(date +%s)000000000
    "

Expected Result

It is unexpected behaviour to keep it sending. Send a single datapoint only once. If there no other option, then at least make this configurable how long it will keep sending stale data.

Actual Result

Debug log shows only 1 entry.
prometheus endpoint shows this data for the period as defined with metric_expiration. For prometheus endpoint I can understand that you don't know if the endpoint has been scraped. prometheus has the setting send_timestamps: true, then you can see when that last value is updated. A scraper can detect old/stale data.
prometheusremotewrite keeps sending the data for 5 minutes

Collector version

0.87.0

Environment information

OpenTelemetry Collector 0.87.0 docker container: otel/opentelemetry-collector-contrib:0.87.0

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

  influxdb:
    endpoint: 0.0.0.0:8086

processors:
  batch:

exporters:
  prometheusremotewrite/grafana_cloud_metrics:
    endpoint: "https://....grafana.net/api/prom/push"
    auth:
      authenticator: basicauth/grafana_cloud_prometheus

  prometheus:
    endpoint: "0.0.0.0:8889"
    send_timestamps: true
    metric_expiration: 23s
    resource_to_telemetry_conversion:
      enabled: true

  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

service:
    metrics:
      receivers: [otlp, influxdb]
      processors: []
      exporters: [prometheusremotewrite/grafana_cloud_metrics, prometheus, debug]

Log output

2023-10-20T19:23:25.717Z        info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2023-10-20T19:23:25.717Z        info    ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope  
Metric #0
Descriptor:
     -> Name: airSensors_temperature
     -> Description: 
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> sensor_id: Str(TLM0201)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-10-20 19:23:25 +0000 UTC
Value: 75.970382
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cbos commented 1 year ago

I have looked a bit more into this problem. As I use Grafana Cloud, I thought this might be a problem with Mimir. So I did a new test with a local prometheus and enabled remote write on the that instance. But that gives the same behaviour.

I did a reach for this and found this article: https://www.robustperception.io/staleness-and-promql/

If prometheus scrape detects that an instance is down, if marks all related time series stale with a stale marker. If a time serie is not marked as stale and does not have updates, prometheus gives the results for 5 minutes. Prometheus remote write specs has covered that https://prometheus.io/docs/concepts/remote_write_spec/#stale-markers

But how can that be applied?

The prometheusreceiver can implement this probably. But how does that work for the otlpreceiver, if an (java) instrumented application crashes, it does not send the metrics anymore. How fast will it be marked as stale?

Same as the httpreceiver, old time series should be marked stale somehow. Or a configuration option for the prometheusremotewrite should have a configuration option to mark time series stale if there are no updates after xxx time.

5 minutes is a long time to detect if an application has stopped.

dashpole commented 1 year ago

I believe the PRW exporter will not send points more than once unless it is retrying a failure. I strongly suspect what is happening is that prometheus displays a line for 5 minutes after it receives a point unless it receives a staleness marker. But since staleness markers are prometheus-specific, you won't get them when receiving data from non-prometheus sources.

The prometheusreceiver can implement this probably.

The prometheus receiver does implement this, and it should work correctly. It uses the OTLP data point flag for "no recorded value" to indicate that a series is stale. The PRW exporter should send a staleness marker when it sees that data point flag.

Overall, this is WAI, although the current UX isn't ideal. There are two potential paths forward:

Push exporters (e.g. OTLP, influx) start sending a version of staleness markers (e.g. by sending "no recorded value" points on shutdown).
The prometheus server uses service discovery to determine which applications it expects data to be pushed from, and generates staleness markers when the "discovered entity" disappears.

jwcesign commented 1 year ago

/cc @Aneurysm9 @rapphil

jwcesign commented 1 year ago

I solved by setting the Mimir parameter: lookback_delta: 1s

jmichalek132 commented 1 year ago

I don't think this should be marked as bug, nor it's an issue with the remote write exporter, as metion in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/27893#issuecomment-1779605583, this is due to the other types of receivers not having staleness markers. When the metrics are ingested via a receiver that supports them, the remote write exporter sends it to the prometheus backend. What do you think @dashpole?

dashpole commented 12 months ago

Agreed. I consider this a feature request to add a notion of staleness to OTel, which would is presumably blocked on such a thing existing in the specification.

github-actions[bot] commented 9 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 7 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

open-telemetry / opentelemetry-collector-contrib