Closed cbos closed 7 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
I have looked a bit more into this problem. As I use Grafana Cloud, I thought this might be a problem with Mimir. So I did a new test with a local prometheus and enabled remote write on the that instance. But that gives the same behaviour.
I did a reach for this and found this article: https://www.robustperception.io/staleness-and-promql/
If prometheus scrape detects that an instance is down, if marks all related time series stale with a stale marker. If a time serie is not marked as stale and does not have updates, prometheus gives the results for 5 minutes. Prometheus remote write specs has covered that https://prometheus.io/docs/concepts/remote_write_spec/#stale-markers
But how can that be applied?
The prometheusreceiver can implement this probably. But how does that work for the otlpreceiver, if an (java) instrumented application crashes, it does not send the metrics anymore. How fast will it be marked as stale?
Same as the httpreceiver, old time series should be marked stale somehow. Or a configuration option for the prometheusremotewrite should have a configuration option to mark time series stale if there are no updates after xxx time.
5 minutes is a long time to detect if an application has stopped.
I believe the PRW exporter will not send points more than once unless it is retrying a failure. I strongly suspect what is happening is that prometheus displays a line for 5 minutes after it receives a point unless it receives a staleness marker. But since staleness markers are prometheus-specific, you won't get them when receiving data from non-prometheus sources.
The prometheusreceiver can implement this probably.
The prometheus receiver does implement this, and it should work correctly. It uses the OTLP data point flag for "no recorded value" to indicate that a series is stale. The PRW exporter should send a staleness marker when it sees that data point flag.
Overall, this is WAI, although the current UX isn't ideal. There are two potential paths forward:
/cc @Aneurysm9 @rapphil
I solved by setting the Mimir parameter: lookback_delta: 1s
I don't think this should be marked as bug, nor it's an issue with the remote write exporter, as metion in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/27893#issuecomment-1779605583, this is due to the other types of receivers not having staleness markers. When the metrics are ingested via a receiver that supports them, the remote write exporter sends it to the prometheus backend. What do you think @dashpole?
Agreed. I consider this a feature request to add a notion of staleness to OTel, which would is presumably blocked on such a thing existing in the specification.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Component(s)
exporter/prometheusremotewrite
What happened?
Description
prometheusremotewrite keeps sending data while the receiver only provides data once in a while or stopped delivering data.
What is the problem with that:
httpreceiver
is will first report:As soon as it gives problems the new data is from the httpreceiver will be like:
But the actual send by prometheusremotewrite for 5 minutes.
As soon as this is flacky, you will not see the switches either.
Steps to Reproduce
It is easy to reproduce with influxdb receiver (as provided in the separate config)
Expected Result
It is unexpected behaviour to keep it sending. Send a single datapoint only once. If there no other option, then at least make this configurable how long it will keep sending stale data.
Actual Result
prometheus
endpoint shows this data for the period as defined withmetric_expiration
. Forprometheus
endpoint I can understand that you don't know if the endpoint has been scraped.prometheus
has the settingsend_timestamps: true
, then you can see when that last value is updated. A scraper can detect old/stale data.prometheusremotewrite
keeps sending the data for 5 minutesCollector version
0.87.0
Environment information
OpenTelemetry Collector 0.87.0 docker container: otel/opentelemetry-collector-contrib:0.87.0
OpenTelemetry Collector configuration
Log output
Additional context
No response