Open dimitarvdimitrov opened 1 year ago
- the querier sends a
TriggerBlockLoad([]ULID) map[ULID]bool
(naming suggestions welcome) RPC to each of the 3 store-gateways concurrently
Maybe we could move this up a level to make a call to all store-gateways that will be involved in the query (instead making a request to 3 of them for each block)?
Naming: I've heard "pre-flight checks" used for something similar in the past.
- to be less aggressive the store-gateway can enqueue the blocks which we are lazy-loading. This can also be controlled via
blocks-storage.bucket-store.meta-sync-concurrency
Did you mean -blocks-storage.bucket-store.index-header.lazy-loading-concurrency
?
- the querier sends a
TriggerBlockLoad([]ULID) map[ULID]bool
(naming suggestions welcome) RPC to each of the 3 store-gateways concurrentlyMaybe we could move this up a level to make a call to all store-gateways that will be involved in the query (instead making a request to 3 of them for each block)?
Today the store-gateway involved in the query are only one per block. I think we'd need to involve all 3 replicas for every block in the query. That way we can choose the most ready store-gateway.
- to be less aggressive the store-gateway can enqueue the blocks which we are lazy-loading. This can also be controlled via
blocks-storage.bucket-store.meta-sync-concurrency
Did you mean
-blocks-storage.bucket-store.index-header.lazy-loading-concurrency
?
yes. This has already been done since opening this issue
Context
The store-gateway can lazily load the index header of a block when each block is requested by the querier. The store-gateway loses these loaded index headers upon restart and also unloads them when they haven't been queried after time idle period (3h by default). Loading one index header can take between seconds and minutes.
Related to https://github.com/grafana/mimir/issues/4762
Problem
When the store-gateway crashes, is rescheduled on a new node or rolled out with a new version it loses the index headers. This means that subsequent queries for the blocks of these index headers will suffer a latency increase. But it's also possible that other store-gateways in other zones have this index header already loaded.
Proposal
Change the querier to do a pre-request check in store-gateways for which replicas have the requested blocks already loaded.
TriggerBlockLoad([]ULID) map[ULID]bool
(naming suggestions welcome) RPC to each of the 3 store-gateways concurrentlyTriggerBlockLoad
is a map from blockID to whether the block is already loaded; if a block isn't already loaded the store-gateway starts lazy-loading its index headerAlternatives