This issue is just me documenting my findings from trying to remove the expensive sort that the querier does here. Some manual tests I did in a staging environment showed that the sort alone takes 50% of the entire time needed to fetch and deduplicate all series.
That change alone is not sufficient to completely remove a very expensive re-sorting of the data which is mandatory for deduplication to work correctly. We need to re-sort series because the dedup process relies on series being sorted without taking into account their replica labels, so that series that are replicas of each other end up together in the series set. Such sorting is not easy to achieve in stores without buffering, because the TSDB has no notion of replica labels, and can only return series sorted with replica labels as part of the comparison function.
After stripping or removing the replica label, series would end up in unsorted order and a=1, z=1 will not be properly deduplicated.
The checks added in https://github.com/thanos-io/thanos/pull/5296 can still create false positive when a store sends data for multiple replicas, but the data is already sorted in the same order as needed for deduplication. A concrete case would be a layered setup where one querier connects to the proxy of another querier. The proxy in this case would properly sort data for dedup, but multi-replica detection would trigger a global sort in the root querier. There can also be cases where store-gateway sends multi-replica data sorted in order appropriate for dedup, such as:
There are two potential solutions that come to mind. We can pursue either of them or come up with something else.
Replace the multi-replica detection with a check where we assert that series coming from each store are sorted correctly for dedup. This can be done by keeping track of the last received series in the asyncResponseSet. We can then compare the current series with the previous, without taking into account replica labels. In this case we still need to wait for all series to be received before starting deduplication. The reason for this is that until we have seen the last series, we cannot know whether data is properly sorted for dedup.
If we merge https://github.com/thanos-io/thanos/pull/5692, replica labels can be propagated to stores in the storepb.SeriesRequest. This will allows us to sort series in a manner that is ready for dedup directly in individual stores. However, this will be challening to do without buffering, since we cannot iterate through series from TSDB in an arbitrary order. TSDB has its own sorting order which is not aware of replica labels. If we manage to achieve this, all detections from the querier can be removed and we can rely on stores sending data in dedup-ready order.
Note
We have been aiming at removing buffering in the seriesServer completely and streaming series on-the-fly into the engine. While this seems like removing a large bottleneck, in practice the benefits could be fairly marginal. PromQL evaluation is done step-by-step, which means we need to access all series immediatelly for the first step. As a result, the current engine expands all series as soon as it receives them, so removing the additional buffering only saves us one linear pass through all series.
This issue is just me documenting my findings from trying to remove the expensive sort that the querier does here. Some manual tests I did in a staging environment showed that the sort alone takes 50% of the entire time needed to fetch and deduplicate all series.
This issue also proposes some additional improvements that can be done as follow ups of https://github.com/thanos-io/thanos/pull/5692, or even directly in that PR.
Problem
Thanks to https://github.com/thanos-io/thanos/pull/5296, we now have a mechanism to merge series coming from multiple stores on the fly.
That change alone is not sufficient to completely remove a very expensive re-sorting of the data which is mandatory for deduplication to work correctly. We need to re-sort series because the dedup process relies on series being sorted without taking into account their replica labels, so that series that are replicas of each other end up together in the series set. Such sorting is not easy to achieve in stores without buffering, because the TSDB has no notion of replica labels, and can only return series sorted with replica labels as part of the comparison function.
PR https://github.com/thanos-io/thanos/pull/5692 adds detections which allow us to avoid this sort in certain cases. The sort is not necessary if:
Condition 2. is mandatory because data sent in this order from a single store is not ready for dedup:
After stripping or removing the
replica
label, series would end up in unsorted order anda=1, z=1
will not be properly deduplicated.The checks added in https://github.com/thanos-io/thanos/pull/5296 can still create false positive when a store sends data for multiple replicas, but the data is already sorted in the same order as needed for deduplication. A concrete case would be a layered setup where one querier connects to the proxy of another querier. The proxy in this case would properly sort data for dedup, but multi-replica detection would trigger a global sort in the root querier. There can also be cases where store-gateway sends multi-replica data sorted in order appropriate for dedup, such as:
Potential solutions
There are two potential solutions that come to mind. We can pursue either of them or come up with something else.
Replace the multi-replica detection with a check where we assert that series coming from each store are sorted correctly for dedup. This can be done by keeping track of the last received series in the asyncResponseSet. We can then compare the current series with the previous, without taking into account replica labels. In this case we still need to wait for all series to be received before starting deduplication. The reason for this is that until we have seen the last series, we cannot know whether data is properly sorted for dedup.
If we merge https://github.com/thanos-io/thanos/pull/5692, replica labels can be propagated to stores in the storepb.SeriesRequest. This will allows us to sort series in a manner that is ready for dedup directly in individual stores. However, this will be challening to do without buffering, since we cannot iterate through series from TSDB in an arbitrary order. TSDB has its own sorting order which is not aware of replica labels. If we manage to achieve this, all detections from the querier can be removed and we can rely on stores sending data in dedup-ready order.
Note
We have been aiming at removing buffering in the seriesServer completely and streaming series on-the-fly into the engine. While this seems like removing a large bottleneck, in practice the benefits could be fairly marginal. PromQL evaluation is done step-by-step, which means we need to access all series immediatelly for the first step. As a result, the current engine expands all series as soon as it receives them, so removing the additional buffering only saves us one linear pass through all series.