thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.99k stars 2.08k forks source link

Receive: Group replicated requests by endpoint #5807

Open matej-g opened 1 year ago

matej-g commented 1 year ago

Related to issues / work in https://github.com/thanos-io/thanos/pull/5791, https://github.com/thanos-io/thanos/issues/5784#issuecomment-1279147288, https://github.com/thanos-io/thanos/pull/5604

After https://github.com/thanos-io/thanos/pull/5604 was introduced, we started to group replicated requests for every combination of endpoint-replica, instead of just by endpoint (which was acceptable before introduction of new hashing mechanism). With these changes, we now may end up with r * (n <= number of nodes) requests where r is replication factor and n is number of nodes on which series will land (previously, this number was equal to r as we were sending one replication request per node).

We could still simplify this by batching by endpoints instead, in which case we would end up with request number <= number of nodes. This would have the advantage of having fewer forward requests instead of sending larger number of smaller requests. The quorum confirmation would then depend on how many requests (nodes) can we afford to lose and still guarantee quorum.

This change would needed to be introduced gradually, though, if we'd like to keep it backwards compatible. This is because the logic that batches the receive requests requires the replica number to work properly (which would not be available if we group by endpoint only). However, since change in https://github.com/thanos-io/thanos/pull/5604, this batching can be skipped altogether, since replicated requests already come 'pre-batched' as we do this already in the replication method.

The caveat is that if a user has a mix of old and new version receivers (which is typical case during roll out, where receiver is rolled out one node at a time), the change described above could cause different versions of receiver to handle requests differently. However, the change would be safe if we instruct users to update directly between two consecutive versions (e.g. from 0.30.0 to 0.31.0) and not to jump over a version number. For this, we would need 1) first skip batching if we receive already replicated request in one release; 2) start grouping requests by endpoint only in the subsequent release; 3) inform users to not jump over versions.

bwplotka commented 1 year ago

I would prefer this solution TBH. The current algo is quite convoluted. I would propose introducing replication to the Get API of sharding so we can get multiple nodes for series X for multiple replicas. OR rebatch things (group them per endpoint) once the distinction change happens:

image

This is because the logic that batches the receive requests requires the replica number to work properly (which would not be available if we group by endpoint only).

Why not moving this logic to sharding itself, in very beginning of batching? 🤔

matej-g commented 1 year ago

Why not moving this logic to sharding itself, in very beginning of batching? thinking

We could do either of the suggested options (push sharding to Get method of hashring and / or do sharding in the beginning of batching), however I don't think it would still help us get over potential compatibility issue (series distribution) between versions.

bwplotka commented 1 year ago

Not sure if one-time pain of migration is worth of complex and potentially wrong algorithm (since it's hard to ensure its correctness), but maybe there is a way to at least mitigate the consequences of migration. Some resharding on Thanos upgrade is not too bad - perhaps some logic of switching to new code only when all nodes in quorum are upgraded is also helpful?