Support Mimir high-availability deduplication in Prometheus remote write sink

aitchjoe commented 6 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Our metric pipeline is:

+---------------------+
|  OpenShift Cluster  |
|  +---------------+  |        +-------------+        +-------------+
|  | Prometheus 1--|--|------->|    Vector   |        |             |
|  +---------------+  |        |  Aggregator |------->|    Mimir    |
|  | Prometheus 2--|--|------->|             |        |             |
|  +---------------+  |        +-------------+        +-------------+
+---------------------+

When Mimir enable high-availability deduplication, there are many err-mimir-sample-duplicate-timestamp erros in mimir-distributor log:

ts=2024-03-18T06:10:18.3960498Z caller=push.go:171 level=error user=o11y msg="push error" err="failed pushing to ingester mimir-ingester-0: user=o11y: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-03-18T06:10:06.211Z and is from series cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{cluster=\"mimir\", deployment_environment=\"test\", job=\"infra-o11y/query-scheduler\", k8s_cluster_name=\"paas-wh-02-lab\", k8s_namespace_name=\"infra-o11y\", le=\"25.0\", prometheus=\"openshift-user-workload-monitoring/user-workload\", route=\"ready\", service_namespace=\"o11y\"}"

But if Prometheus remote write to Mimir directly, there is no error. After some debug, we think it is caused by batch config of Prometheus remote write sink. Becasue Vector aggregator receive data from two Prometheus instances, but when it send to Mimir, one batch submit maybe mix two instances data, and in Mimir distributor.go:

    if d.limits.AcceptHASamples(userID) && len(req.Timeseries) > 0 {
        cluster, replica := findHALabels(d.limits.HAReplicaLabel(userID), d.limits.HAClusterLabel(userID), req.Timeseries[0].Labels)
        // Make a copy of these, since they may be retained as labels on our metrics, e.g. dedupedSamples.
        cluster, replica = copyString(cluster), copyString(replica)
        if span != nil {
            span.SetTag("cluster", cluster)
            span.SetTag("replica", replica)
        }
        removeReplica, err = d.checkSample(ctx, userID, cluster, replica)

That mean Mimir only check the first data (req.Timeseries[0].Labels) in batch to accept or drop all data, so after Mimir remove replica label, err-mimir-sample-duplicate-timestamp happened. When we tried to change Prometheus remote write sink batch.max_events from default 1000 to 1, the error gone which confirmed our guess.

Attempted Solutions

Change Prometheus remote write sink batch.max_events to 1, but it is bad for performance.
Route different cluster+replica data to different Prometheus remote write sink, force batch data include only one cluster+replica data, but if we have multiple clusters, the config is mess.
Change distributor.go on Mimir side.

Proposal

Add Prometheus HA config in Prometheus remote write sink, dont batch different cluster or replica data.

References

No response

Version

vector 0.35.0

suslikas commented 5 months ago

I have other use case, but i think my solution can solve your problem too. Just after Aggregator little bit modify time.

"timestamp": parse_timestamp!(from_unix_timestamp!(to_unix_timestamp!(.original_timestamp, unit: "milliseconds") + random_int(1000, 9999), unit: "milliseconds"), "%+")

aitchjoe commented 5 months ago

I have other use case, but i think my solution can solve your problem too. Just after Aggregator little bit modify time.

For Mimir HA deduplication, we need to drop this metric even we can change timestamp to accept.

vectordotdev / vector