Open dimitarvdimitrov opened 1 year ago
I have my changes on the dimitar/ha-dedup-on-every-sample-bench branch. The current version does a lot of copying. I think we can at least halve it and also use the memory pools in mimirpb
. Current benchmarks are below.
Interesting bits are that
benchmark old ns/op new ns/op delta
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10 5798168 5553204 -4.22%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10 3388351 3225461 -4.81%
BenchmarkDistributor_Push/timestamp_too_new-10 594279 567984 -4.42%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10 898983 888472 -1.17%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 920753 598234 -35.03%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10 797174 763151 -4.27%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10 544183 517112 -4.97%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10 994895 976238 -1.88%
benchmark old allocs new allocs delta
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10 2142 2142 +0.00%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10 2141 2141 +0.00%
BenchmarkDistributor_Push/timestamp_too_new-10 2086 2084 -0.10%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10 46 61 +32.61%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 48 147 +206.25%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10 43 42 -2.33%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10 46 46 +0.00%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10 2186 2186 +0.00%
benchmark old bytes new bytes delta
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10 120904 120783 -0.10%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10 105392 105326 -0.06%
BenchmarkDistributor_Push/timestamp_too_new-10 90103 89523 -0.64%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10 150586 220569 +46.47%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 151540 131187 -13.43%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10 151064 150370 -0.46%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10 25307 25298 -0.04%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10 80614 80575 -0.05%
- allocations are up "only" 200%; if reduced to <100%, we can consider including this as an opt-in
I would much rather either making this The Behavior or not. Changing the way HA dedupe works based on a configuration setting will make issues when using it even harder to figure out (which is already not easy).
I agree on having no new configuration option. Keep in mind our top 1 goal is to make Mimir easier to use. Less configuration, not more.
HA_dedup;_all_samples_same_replica-10
I haven't looked at the code but I wasn't expecting this case to be slower. Have you considered an optimization for the happy path where we first analise the series in the request and if they all belong to the same cluster/replica then we just keep the current logic as is?
I got some updated benchmarks. Does this look good enough performance-wise? I think the 125% allocations increase is rather misleading. We still do very few allocations compared to our unhappy path benchmarks; allocations are still below a 100 in absolute terms; and in terms of actual bytes allocated there's no regression.
I still haven't investigated the 10% slowdown for the timestamp_too_new
benchmark, but it's not unlikely it is a fluke.
If this looks ok, I will start refactoring the code to make it more digestable. Otherwise, will try to squeeze a bit more out of it
benchmark old ns/op new ns/op delta
BenchmarkDistributor_Push/timestamp_too_new-10 4766185 5293119 +11.06%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10 848159 956263 +12.75%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 868131 785052 -9.57%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10 803742 798753 -0.62%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10 600571 599008 -0.26%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10 5433408 5281949 -2.79%
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10 10327507 10173125 -1.49%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10 8083825 8490655 +5.03%
benchmark old allocs new allocs delta
BenchmarkDistributor_Push/timestamp_too_new-10 2085 2086 +0.05%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10 44 44 +0.00%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 44 99 +125.00%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10 41 42 +2.44%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10 46 47 +2.17%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10 2187 2188 +0.05%
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10 2159 2158 -0.05%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10 2158 2159 +0.05%
benchmark old bytes new bytes delta
BenchmarkDistributor_Push/timestamp_too_new-10 90285 90085 -0.22%
BenchmarkDistributor_Push/HA_dedup;_all_samples_same_replica-10 151575 150887 -0.45%
BenchmarkDistributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 151367 85778 -43.33%
BenchmarkDistributor_Push/all_samples_successfully_pushed-10 151247 150567 -0.45%
BenchmarkDistributor_Push/ingestion_rate_limit_reached-10 25411 25288 -0.48%
BenchmarkDistributor_Push/too_many_labels_limit_reached-10 80912 81183 +0.33%
BenchmarkDistributor_Push/max_label_name_length_limit_reached-10 124883 124374 -0.41%
BenchmarkDistributor_Push/max_label_value_length_limit_reached-10 109394 109551 +0.14%
another proposal from @colega is to reject the HA request if it contains multiple HA replicas. This will make detecting this problem easier too, but will still require some effort from users to change their setup. However, this will be a behaviour change for the HA tracker. We should verify whether, at least in Grafana Cloud, there are customers relying on this behaviour.
i rebased the ha-dedup-on-every-sample-bench branch, added some special cases for push requests with a single HA replica. Below are the benchmark results I got. There is only a 2.7% time degradation for the current common case (HA_dedup;_all_samples_same_replica
). I think this is acceptable and I suspect the impact will be negligible if we take into account protobuf parsing. I am considering cleaning up the code and opening a PR next
``` goos: darwin goarch: arm64 pkg: github.com/grafana/mimir/pkg/distributor │ before.txt │ after.txt │ │ sec/op │ sec/op vs base │ Distributor_Push/max_label_name_length_limit_reached-10 2.883m ± 2% 2.829m ± 1% -1.87% (p=0.010 n=8) Distributor_Push/timestamp_too_new-10 456.8µ ± 1% 463.3µ ± 1% +1.42% (p=0.002 n=8) Distributor_Push/all_samples_go_to_metric_relabel_configs-10 1.471m ± 2% 1.471m ± 0% ~ (p=1.000 n=8) Distributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 774.2µ ± 1% 564.8µ ± 1% -27.04% (p=0.000 n=8) Distributor_Push/all_samples_successfully_pushed-10 706.7µ ± 1% 709.5µ ± 2% ~ (p=0.704 n=8) Distributor_Push/ingestion_rate_limit_reached-10 341.2µ ± 0% 327.1µ ± 1% -4.14% (p=0.000 n=8) Distributor_Push/too_many_labels_limit_reached-10 527.9µ ± 3% 535.4µ ± 2% +1.43% (p=0.010 n=8) Distributor_Push/max_label_value_length_limit_reached-10 475.3µ ± 1% 482.3µ ± 3% ~ (p=0.061 n=8) Distributor_Push/HA_dedup;_all_samples_same_replica-10 764.3µ ± 1% 784.9µ ± 1% +2.70% (p=0.000 n=8) geomean 743.4µ 718.8µ -3.31% │ before.txt │ after.txt │ │ B/op │ B/op vs base │ Distributor_Push/max_label_name_length_limit_reached-10 166.9Ki ± 0% 167.0Ki ± 0% ~ (p=0.223 n=8) Distributor_Push/timestamp_too_new-10 141.2Ki ± 0% 140.9Ki ± 0% -0.17% (p=0.000 n=8) Distributor_Push/all_samples_go_to_metric_relabel_configs-10 1.306Mi ± 0% 1.306Mi ± 0% -0.06% (p=0.036 n=8) Distributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 154.77Ki ± 0% 84.47Ki ± 1% -45.42% (p=0.000 n=8) Distributor_Push/all_samples_successfully_pushed-10 155.4Ki ± 0% 155.2Ki ± 1% ~ (p=0.430 n=8) Distributor_Push/ingestion_rate_limit_reached-10 5.097Ki ± 1% 5.101Ki ± 1% ~ (p=0.860 n=8) Distributor_Push/too_many_labels_limit_reached-10 132.3Ki ± 0% 132.3Ki ± 0% ~ (p=0.703 n=8) Distributor_Push/max_label_value_length_limit_reached-10 153.6Ki ± 0% 153.6Ki ± 0% ~ (p=0.860 n=8) Distributor_Push/HA_dedup;_all_samples_same_replica-10 154.5Ki ± 0% 154.8Ki ± 1% ~ (p=0.062 n=8) geomean 132.0Ki 123.4Ki -6.51% │ before.txt │ after.txt │ │ allocs/op │ allocs/op vs base │ Distributor_Push/max_label_name_length_limit_reached-10 3.164k ± 0% 3.165k ± 0% +0.03% (p=0.007 n=8) Distributor_Push/timestamp_too_new-10 3.127k ± 0% 3.127k ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/all_samples_go_to_metric_relabel_configs-10 7.244k ± 0% 7.245k ± 0% ~ (p=0.069 n=8) Distributor_Push/HA_dedup;_4_clusters_8_replicas_evenly_split-10 58.50 ± 1% 118.00 ± 2% +101.71% (p=0.000 n=8) Distributor_Push/all_samples_successfully_pushed-10 56.50 ± 3% 58.00 ± 3% ~ (p=0.122 n=8) Distributor_Push/ingestion_rate_limit_reached-10 55.00 ± 0% 55.00 ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/too_many_labels_limit_reached-10 3.234k ± 0% 3.234k ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/max_label_value_length_limit_reached-10 3.172k ± 0% 3.172k ± 0% ~ (p=1.000 n=8) ¹ Distributor_Push/HA_dedup;_all_samples_same_replica-10 58.00 ± 2% 58.50 ± 3% ~ (p=0.471 n=8) geomean 582.8 632.5 +8.53% ¹ all samples are equal ```
Only prombot* add HA labels
This is the only place "prom_bot" appears in the description. What does it refer to?
Sorry, I meant to write something else:
Only
prom_a_*
andprom_b_*
add HA labels (__replica__
andcluster
) to metrics.
I won't be able to continue working on this in the foreseeable future. The code for it is on the ha-dedup-on-every-sample-bench branch in case someone wants to pick this up
This is a feature request to be able to deduplicate HA prometheus samples within a remote-write request.
Current state
Mimir supports receiving duplicated samples from multiple promethei and actually writing only one of the promethei. Docs
When a prometheus sends a remote-write request the distributor looks at the labels only of the first sample. The distributor assumes that all the samples in the remote-write request are from the same prometheus replica.
Feature request
Make the distributor look at each sample in the write request and apply the deduplication logic on each sample separately instead of applying it on all samples in the batch.
Use case
This can be useful when a user has set up prometheus federation with replication. At the top is mimir. Right below it are two per-datacenter promethei (
prom_1
,prom_2
). Below those two pairs of HA promethei (prom_a_1
,prom_a_2
,prom_b_1
,prom_b_2
).prom_a_1
andprom_a_2
scrape the same targets andprom_b_1
,prom_b_2
scrape the same targets. Allprom_a_1
,prom_a_2
,prom_b_1
, andprom_b_2
remote-write to bothprom_1
andprom_2
. Onlyprom_a_*
andprom_b_*
add HA labels (__replica__
andcluster
) to metrics.Now the remote-write batches of
prom_1
andprom_2
to Mimir can contain samples from any of the four bottom promethei. So in this case the batch can contain mixed samples from an elected replica (e.g.prom_a_1
) and a secondary replica (e.g.prom_b_1
). If the first sample in a batch is fromprom_b_1
then the samples fromprom_a_1
will also be discarded even thoughprom_a_1
is the elected replica.In this case the user wants Mimir to choose one replica from
prom_a_1
|prom_a_2
and one fromprom_b_1
|prom_b_2
and consistently accepted samples from each regardless of how the batches are assembled.Other notes
I have some sort of a POC locally. I will open a PR with some benchmarks soon.