vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.46k stars 1.53k forks source link

Support Mimir high-availability deduplication in Prometheus remote write sink #20119

Open aitchjoe opened 6 months ago

aitchjoe commented 6 months ago

A note for the community

Use Cases

Our metric pipeline is:

+---------------------+
|  OpenShift Cluster  |
|  +---------------+  |        +-------------+        +-------------+
|  | Prometheus 1--|--|------->|    Vector   |        |             |
|  +---------------+  |        |  Aggregator |------->|    Mimir    |
|  | Prometheus 2--|--|------->|             |        |             |
|  +---------------+  |        +-------------+        +-------------+
+---------------------+

When Mimir enable high-availability deduplication, there are many err-mimir-sample-duplicate-timestamp erros in mimir-distributor log:

ts=2024-03-18T06:10:18.3960498Z caller=push.go:171 level=error user=o11y msg="push error" err="failed pushing to ingester mimir-ingester-0: user=o11y: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-03-18T06:10:06.211Z and is from series cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{cluster=\"mimir\", deployment_environment=\"test\", job=\"infra-o11y/query-scheduler\", k8s_cluster_name=\"paas-wh-02-lab\", k8s_namespace_name=\"infra-o11y\", le=\"25.0\", prometheus=\"openshift-user-workload-monitoring/user-workload\", route=\"ready\", service_namespace=\"o11y\"}"

But if Prometheus remote write to Mimir directly, there is no error. After some debug, we think it is caused by batch config of Prometheus remote write sink. Becasue Vector aggregator receive data from two Prometheus instances, but when it send to Mimir, one batch submit maybe mix two instances data, and in Mimir distributor.go:

    if d.limits.AcceptHASamples(userID) && len(req.Timeseries) > 0 {
        cluster, replica := findHALabels(d.limits.HAReplicaLabel(userID), d.limits.HAClusterLabel(userID), req.Timeseries[0].Labels)
        // Make a copy of these, since they may be retained as labels on our metrics, e.g. dedupedSamples.
        cluster, replica = copyString(cluster), copyString(replica)
        if span != nil {
            span.SetTag("cluster", cluster)
            span.SetTag("replica", replica)
        }
        removeReplica, err = d.checkSample(ctx, userID, cluster, replica)

That mean Mimir only check the first data (req.Timeseries[0].Labels) in batch to accept or drop all data, so after Mimir remove replica label, err-mimir-sample-duplicate-timestamp happened. When we tried to change Prometheus remote write sink batch.max_events from default 1000 to 1, the error gone which confirmed our guess.

Attempted Solutions

Proposal

Add Prometheus HA config in Prometheus remote write sink, dont batch different cluster or replica data.

References

No response

Version

vector 0.35.0

suslikas commented 5 months ago

I have other use case, but i think my solution can solve your problem too. Just after Aggregator little bit modify time.

"timestamp": parse_timestamp!(from_unix_timestamp!(to_unix_timestamp!(.original_timestamp, unit: "milliseconds") + random_int(1000, 9999), unit: "milliseconds"), "%+")
aitchjoe commented 5 months ago

I have other use case, but i think my solution can solve your problem too. Just after Aggregator little bit modify time.

For Mimir HA deduplication, we need to drop this metric even we can change timestamp to accept.