store-gateway: store sparse index headers in object store

dimitarvdimitrov commented 1 month ago

Background

https://github.com/grafana/mimir/issues/5046

Problem

This manifests when scaling out store-gateways: new store-gateways download index headers to disk whenever they start and before becoming ready. However, the sparse index headers are only constructed when the block is lazily-loaded and are local to a store-gateway replica. This means that a new store-gateway replica doesn't have any sparse index headers right after starting up and has to load the index header and create the sparse header from the data on disk before being able to serve requests for a given block. This leads to severe increases in latency after scaling out store-gateways.

Proposal

In store-gateways download sparse index headers when adding a new block (either periodic syncs or at startup). This solves the problem of scaling out
If the sparse header doesn't exist in object store, then construct the sparse header and upload it.

Compactor or store-gateway

Part 2. can be done by either the compactor or the store-gateway. Doing this in the store-gateway would require less code restructuring and can be a good first iteration.

The possible complication is that the store-gateway now has to load the index header upon downloading a new block which no other store-gateway has synced before. While the store-gateway should unload it immediately after constructing the sparse index header, this can increase the time to do the periodic syncs. The situation shouldn't be worse than today where this time penalty is paid at query time. Having to load the index header would also impact the compactor compaction cycle latency.

GroovyCarrot commented 2 weeks ago

This really bit me recently as well. We'd had 27TB of data rack up in the block store, and then tried to start store-gateway nodes; they basically never started / reported ready as the index was so huge.

Also I found the helm chart uses the default pod management policy for the storegateway, and will block spinning up any additional nodes until the previous one has reported ready. I think podManagementPolicy: Parallel will fix this, though didn't try it as we decided just to destroy the bucket and start fresh. I expect this change would allow multiple nodes to spin up and then be able to decide what tenants/tokens they are responsible for, rather than one starting up and thinking it needs to index everything before anything else is allowed to start.

I think it makes sense for the compactor to do this as it is changing the index anyway when it runs?

Is it possible to optimise this by compiling an index per day, or something? And then store those indices for lazy-loading by the store-gateways if a queries is ran for that period? Seems like then you can start a store-gateway node and it can start taking queries practically straight away?

dimitarvdimitrov commented 2 weeks ago

I think you're bringing up another problem. When the store-gateway starts it downloads from the bucket the index headers for blocks that shard to it. Figuring out which blocks shard to it is fast, but downloading the index headers from the bucket it slow. It's better to do this before starting up; otherwise, this latency would hit queries.

Also I found the helm chart uses the default pod management policy for the storegateway, and will block spinning up any additional nodes until the previous one has reported ready. I think podManagementPolicy: Parallel will fix this

This is the other problem. It's already configurable, but making it the default is a breaking change, so we've been saving this for helm chart 6.0 (#4560)

This issue (8166) is about then sampling the index headers when a query comes in. The sampled version is called the "sparse index header" and is also persisted on disk today. Sampling requires reading (effectively) the full index header from disk with a lot of random reads, that's why it's slow. The sparse header is computed lazily. This issue suggests to compute it in the compactor and quickly download it in the store-gateway instead of having to sample the index header if the sparse index header is not already on disk.

Is it possible to optimise this by compiling an index per day, or something?

blocks are already split into 24h ranges; if you're using the split-and-merge compactor, then there can even be multiple blocks per 24h range.

dimitarvdimitrov commented 6 days ago

some notes from the comments in the PR: it won't actually be that hard to let the compactor create sparse headers and upload them

I chatted with @ pstibrany and he suggested doing this at the end of BucketCompactor.runCompactionJob so that we don't fail compactions is sparse headers can't be uploaded. It makes sense to still keep the ability to create sparse headers in the store-gateways so they are more autonomous and don't depend on the compactor for performance.

Worth noting that the compactors should upload these sparse headers for new blocks only as not to create a very huge backlog upon deploying a new Mimir version. But store-gateways should still be able to construct sparse headers themselves if those aren't available in the bucket.

grafana / mimir