grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.15k stars 535 forks source link

store-gateway: reduce latency impact due to index header lazy-loading #4763

Open dimitarvdimitrov opened 1 year ago

dimitarvdimitrov commented 1 year ago

Context

The store-gateway can lazily load the index header of a block when each block is requested by the querier. The store-gateway loses these loaded index headers upon restart and also unloads them when they haven't been queried after time idle period (3h by default). Loading one index header can take between seconds and minutes.

Related to https://github.com/grafana/mimir/issues/4762

Problem

When the store-gateway crashes, is rescheduled on a new node or rolled out with a new version it loses the index headers. This means that subsequent queries for the blocks of these index headers will suffer a latency increase. But it's also possible that other store-gateways in other zones have this index header already loaded.

Proposal

Change the querier to do a pre-request check in store-gateways for which replicas have the requested blocks already loaded.

Alternatives

56quarters commented 6 months ago
  • the querier sends a TriggerBlockLoad([]ULID) map[ULID]bool (naming suggestions welcome) RPC to each of the 3 store-gateways concurrently

Maybe we could move this up a level to make a call to all store-gateways that will be involved in the query (instead making a request to 3 of them for each block)?

Naming: I've heard "pre-flight checks" used for something similar in the past.

  • to be less aggressive the store-gateway can enqueue the blocks which we are lazy-loading. This can also be controlled via blocks-storage.bucket-store.meta-sync-concurrency

Did you mean -blocks-storage.bucket-store.index-header.lazy-loading-concurrency ?

dimitarvdimitrov commented 6 months ago
  • the querier sends a TriggerBlockLoad([]ULID) map[ULID]bool (naming suggestions welcome) RPC to each of the 3 store-gateways concurrently

Maybe we could move this up a level to make a call to all store-gateways that will be involved in the query (instead making a request to 3 of them for each block)?

Today the store-gateway involved in the query are only one per block. I think we'd need to involve all 3 replicas for every block in the query. That way we can choose the most ready store-gateway.

  • to be less aggressive the store-gateway can enqueue the blocks which we are lazy-loading. This can also be controlled via blocks-storage.bucket-store.meta-sync-concurrency

Did you mean -blocks-storage.bucket-store.index-header.lazy-loading-concurrency ?

yes. This has already been done since opening this issue