HA deduplication per-sample instead of per-batch

dimitarvdimitrov commented 1 year ago

This is a feature request to be able to deduplicate HA prometheus samples within a remote-write request.

Current state

Mimir supports receiving duplicated samples from multiple promethei and actually writing only one of the promethei. Docs

When a prometheus sends a remote-write request the distributor looks at the labels only of the first sample. The distributor assumes that all the samples in the remote-write request are from the same prometheus replica.

Feature request

Make the distributor look at each sample in the write request and apply the deduplication logic on each sample separately instead of applying it on all samples in the batch.

Use case

This can be useful when a user has set up prometheus federation with replication. At the top is mimir. Right below it are two per-datacenter promethei (prom_1, prom_2). Below those two pairs of HA promethei (prom_a_1, prom_a_2, prom_b_1, prom_b_2).

prom_a_1 and prom_a_2 scrape the same targets and prom_b_1, prom_b_2 scrape the same targets. All prom_a_1, prom_a_2, prom_b_1, and prom_b_2 remote-write to both prom_1 and prom_2. Only prom_a_* and prom_b_* add HA labels (__replica__ and cluster) to metrics.

Now the remote-write batches of prom_1 and prom_2 to Mimir can contain samples from any of the four bottom promethei. So in this case the batch can contain mixed samples from an elected replica (e.g. prom_a_1) and a secondary replica (e.g. prom_b_1). If the first sample in a batch is from prom_b_1 then the samples from prom_a_1 will also be discarded even though prom_a_1 is the elected replica.

In this case the user wants Mimir to choose one replica from prom_a_1|prom_a_2 and one from prom_b_1|prom_b_2 and consistently accepted samples from each regardless of how the batches are assembled.

Other notes

I have some sort of a POC locally. I will open a PR with some benchmarks soon.

dimitarvdimitrov commented 1 year ago

I have my changes on the dimitar/ha-dedup-on-every-sample-bench branch. The current version does a lot of copying. I think we can at least halve it and also use the memory pools in mimirpb. Current benchmarks are below.

Interesting bits are that

when used the feature actually makes HA dedup faster
allocations are up "only" 200%; if reduced to <100%, we can consider including this as an opt-in

benchmark                                                                     old ns/op     new ns/op     delta
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10              5798168       5553204       -4.22%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10             3388351       3225461       -4.81%
BenchmarkDistributor_Push/timestamp_too_new-10                                594279        567984        -4.42%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10               898983        888472        -1.17%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10     920753        598234        -35.03%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10                  797174        763151        -4.27%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10                     544183        517112        -4.97%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10                    994895        976238        -1.88%

benchmark                                                                     old allocs     new allocs     delta
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10              2142           2142           +0.00%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10             2141           2141           +0.00%
BenchmarkDistributor_Push/timestamp_too_new-10                                2086           2084           -0.10%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10               46             61             +32.61%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10     48             147            +206.25%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10                  43             42             -2.33%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10                     46             46             +0.00%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10                    2186           2186           +0.00%

benchmark                                                                     old bytes     new bytes     delta
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10              120904        120783        -0.10%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10             105392        105326        -0.06%
BenchmarkDistributor_Push/timestamp_too_new-10                                90103         89523         -0.64%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10               150586        220569        +46.47%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10     151540        131187        -13.43%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10                  151064        150370        -0.46%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10                     25307         25298         -0.04%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10                    80614         80575         -0.05%

56quarters commented 1 year ago

allocations are up "only" 200%; if reduced to <100%, we can consider including this as an opt-in

I would much rather either making this The Behavior or not. Changing the way HA dedupe works based on a configuration setting will make issues when using it even harder to figure out (which is already not easy).

pracucci commented 1 year ago

I agree on having no new configuration option. Keep in mind our top 1 goal is to make Mimir easier to use. Less configuration, not more.

pracucci commented 1 year ago

HA_dedup;_all_samples_same_replica-10

I haven't looked at the code but I wasn't expecting this case to be slower. Have you considered an optimization for the happy path where we first analise the series in the request and if they all belong to the same cluster/replica then we just keep the current logic as is?

dimitarvdimitrov commented 1 year ago

I got some updated benchmarks. Does this look good enough performance-wise? I think the 125% allocations increase is rather misleading. We still do very few allocations compared to our unhappy path benchmarks; allocations are still below a 100 in absolute terms; and in terms of actual bytes allocated there's no regression.

I still haven't investigated the 10% slowdown for the timestamp_too_new benchmark, but it's not unlikely it is a fluke.

If this looks ok, I will start refactoring the code to make it more digestable. Otherwise, will try to squeeze a bit more out of it

benchmark                                                                     old ns/op     new ns/op     delta
BenchmarkDistributor_Push/timestamp_too_new-10                                4766185       5293119       +11.06%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10               848159        956263        +12.75%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10     868131        785052        -9.57%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10                  803742        798753        -0.62%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10                     600571        599008        -0.26%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10                    5433408       5281949       -2.79%
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10              10327507      10173125      -1.49%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10             8083825       8490655       +5.03%

benchmark                                                                     old allocs     new allocs     delta
BenchmarkDistributor_Push/timestamp_too_new-10                                2085           2086           +0.05%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10               44             44             +0.00%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10     44             99             +125.00%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10                  41             42             +2.44%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10                     46             47             +2.17%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10                    2187           2188           +0.05%
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10              2159           2158           -0.05%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10             2158           2159           +0.05%

benchmark                                                                     old bytes     new bytes     delta
BenchmarkDistributor_Push/timestamp_too_new-10                                90285         90085         -0.22%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10               151575        150887        -0.45%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10     151367        85778         -43.33%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10                  151247        150567        -0.45%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10                     25411         25288         -0.48%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10                    80912         81183         +0.33%
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10              124883        124374        -0.41%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10             109394        109551        +0.14%

dimitarvdimitrov commented 1 year ago

another proposal from @colega is to reject the HA request if it contains multiple HA replicas. This will make detecting this problem easier too, but will still require some effort from users to change their setup. However, this will be a behaviour change for the HA tracker. We should verify whether, at least in Grafana Cloud, there are customers relying on this behaviour.

dimitarvdimitrov commented 12 months ago

i rebased the ha-dedup-on-every-sample-bench branch, added some special cases for push requests with a single HA replica. Below are the benchmark results I got. There is only a 2.7% time degradation for the current common case (HA_dedup;_all_samples_same_replica). I think this is acceptable and I suspect the impact will be negligible if we take into account protobuf parsing. I am considering cleaning up the code and opening a PR next

Details

``` goos: darwin goarch: arm64 pkg: github.com/grafana/mimir/pkg/distributor │ before.txt │ after.txt │ │ sec/op │ sec/op vs base │ Distributor_Push/max_label_name_length_limit_reached-10 2.883m ± 2% 2.829m ± 1% -1.87% (p=0.010 n=8) Distributor_Push/timestamp_too_new-10 456.8µ ± 1% 463.3µ ± 1% +1.42% (p=0.002 n=8) Distributor_Push/all_samples_go_to_metric_relabel_configs-10 1.471m ± 2% 1.471m ± 0% ~ (p=1.000 n=8) Distributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 774.2µ ± 1% 564.8µ ± 1% -27.04% (p=0.000 n=8) Distributor_Push/all_samples_successfully_pushed-10 706.7µ ± 1% 709.5µ ± 2% ~ (p=0.704 n=8) Distributor_Push/ingestion_rate_limit_reached-10 341.2µ ± 0% 327.1µ ± 1% -4.14% (p=0.000 n=8) Distributor_Push/too_many_labels_limit_reached-10 527.9µ ± 3% 535.4µ ± 2% +1.43% (p=0.010 n=8) Distributor_Push/max_label_value_length_limit_reached-10 475.3µ ± 1% 482.3µ ± 3% ~ (p=0.061 n=8) Distributor_Push/HA_dedup;_all_samples_same_replica-10 764.3µ ± 1% 784.9µ ± 1% +2.70% (p=0.000 n=8) geomean 743.4µ 718.8µ -3.31% │ before.txt │ after.txt │ │ B/op │ B/op vs base │ Distributor_Push/max_label_name_length_limit_reached-10 166.9Ki ± 0% 167.0Ki ± 0% ~ (p=0.223 n=8) Distributor_Push/timestamp_too_new-10 141.2Ki ± 0% 140.9Ki ± 0% -0.17% (p=0.000 n=8) Distributor_Push/all_samples_go_to_metric_relabel_configs-10 1.306Mi ± 0% 1.306Mi ± 0% -0.06% (p=0.036 n=8) Distributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 154.77Ki ± 0% 84.47Ki ± 1% -45.42% (p=0.000 n=8) Distributor_Push/all_samples_successfully_pushed-10 155.4Ki ± 0% 155.2Ki ± 1% ~ (p=0.430 n=8) Distributor_Push/ingestion_rate_limit_reached-10 5.097Ki ± 1% 5.101Ki ± 1% ~ (p=0.860 n=8) Distributor_Push/too_many_labels_limit_reached-10 132.3Ki ± 0% 132.3Ki ± 0% ~ (p=0.703 n=8) Distributor_Push/max_label_value_length_limit_reached-10 153.6Ki ± 0% 153.6Ki ± 0% ~ (p=0.860 n=8) Distributor_Push/HA_dedup;_all_samples_same_replica-10 154.5Ki ± 0% 154.8Ki ± 1% ~ (p=0.062 n=8) geomean 132.0Ki 123.4Ki -6.51% │ before.txt │ after.txt │ │ allocs/op │ allocs/op vs base │ Distributor_Push/max_label_name_length_limit_reached-10 3.164k ± 0% 3.165k ± 0% +0.03% (p=0.007 n=8) Distributor_Push/timestamp_too_new-10 3.127k ± 0% 3.127k ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/all_samples_go_to_metric_relabel_configs-10 7.244k ± 0% 7.245k ± 0% ~ (p=0.069 n=8) Distributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 58.50 ± 1% 118.00 ± 2% +101.71% (p=0.000 n=8) Distributor_Push/all_samples_successfully_pushed-10 56.50 ± 3% 58.00 ± 3% ~ (p=0.122 n=8) Distributor_Push/ingestion_rate_limit_reached-10 55.00 ± 0% 55.00 ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/too_many_labels_limit_reached-10 3.234k ± 0% 3.234k ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/max_label_value_length_limit_reached-10 3.172k ± 0% 3.172k ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/HA_dedup;_all_samples_same_replica-10 58.00 ± 2% 58.50 ± 3% ~ (p=0.471 n=8) geomean 582.8 632.5 +8.53% ¹ all samples are equal ```

bboreham commented 8 months ago

Only prombot* add HA labels

This is the only place "prom_bot" appears in the description. What does it refer to?

dimitarvdimitrov commented 8 months ago

Sorry, I meant to write something else:

Only prom_a_* and prom_b_* add HA labels (__replica__ and cluster) to metrics.

dimitarvdimitrov commented 7 months ago

I won't be able to continue working on this in the foreseeable future. The code for it is on the ha-dedup-on-every-sample-bench branch in case someone wants to pick this up

grafana / mimir