panic: chunk ID exceeds 3 bytes During OOO Merge

pearlriver commented 1 year ago

Describe the bug

Mimir crashed with error: "panic: chunk ID exceeds 3 bytes" During OOO Merge

To Reproduce

Steps to reproduce the behavior:

Start Mimir Standalone v2.6 in ECS, have out_of_order_time_window set to 30m
Perform Operations(Read/Write/Others) Metric ingestion rate is about 1.1 Mil Samples per hour

Expected behavior

Mimir should not crash during OOO Merge

Environment

Infrastructure: ECS
Deployment tool: ECS template
Additional Context

log

> --
> /__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/runner.go:45 +0x14d
> github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc000f26d80, {0xc304061830, 0x1, 0x45?}, 0x186ecddbc00, 0x186ed4b9900, {0xc304061860, 0x1, 0x1})
> /__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/runner.go:36 +0x125
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1214 +0x34a
> /__w/mimir/mimir/pkg/ingester/ingester.go:2121 +0x3e5
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_read.go:405 +0xba5
> /__w/mimir/mimir/pkg/ingester/user_tsdb.go:149
>  
> github.com/prometheus/prometheus/tsdb.(*populateWithDelChunkSeriesIterator).Next(0xc2307b4360)
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1170 +0x7a
> panic: chunk ID exceeds 3 bytes
> github.com/prometheus/prometheus/tsdb.(*DB).compactOOOHead(0xc00191e1e0)
> github.com/prometheus/prometheus/tsdb/chunks.NewHeadChunkRef(...)
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:694 +0x259
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/querier.go:544 +0x16d
> github.com/grafana/dskit/concurrency.ForEachUser.func1()
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:817 +0x973
> github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc000f26d80, {0xc0014bad38, 0x13}, {0xc304061860?, 0x1, 0x1}, {0xc304061830, 0x1, 0x1})
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/chunks.go:72
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:1089 +0x19bf
> github.com/prometheus/prometheus/tsdb.(*DB).Compact(0xc00191e1e0)
> github.com/grafana/mimir/pkg/ingester.(*Ingester).compactBlocks.func1({0xc406823b71?, 0xc600534f18?}, {0xc000d6f340, 0x8})
> github.com/prometheus/prometheus/tsdb.OOOHeadChunkReader.Chunk({0xc00072d680?, 0x80?, 0xc0012a0800?}, {0x4ca725c1, {0x0, 0x0}, 0x186ecdd8ac9, 0x186ed4bd407, 0x4cd7353c, 0x186ed4b952a, ...})
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/querier.go:694 +0x35
> github.com/prometheus/prometheus/tsdb.(*populateWithDelGenericSeriesIterator).next(0xc2307b4360)
> github.com/prometheus/prometheus/tsdb.(*DB).compactOOO(0xc00191e1e0, {0xc0014bad38, 0x13}, 0xc1a4d4de50)
> github.com/prometheus/prometheus/tsdb.(*memSeries).oooMergedChunk(0xc000e3a410, {0x4ca725c1, {0x0, 0x0}, 0x186ecdd8ac9, 0x186ed4bd407, 0x4cd7353c, 0x186ed4b952a, 0x186ed4bd407}, {0x2759300, ...}, ...)
> created by github.com/grafana/dskit/concurrency.ForEachUser
> github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Write(0xc000f26d80, {0xc0014bad38, 0x13}, {0x2747cc0, 0xc081b38640}, 0x186ecddbc00, 0x186ed4b9900, 0x0)
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1130 +0x627
> /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/ooo_head_read.go:250 +0x105
> github.com/grafana/mimir/pkg/ingester.(*userTSDB).Compact(...)
> goroutine 40756721 [running]:

yaml:

target: "all"

multitenancy_enabled: false

no_auth_tenant: "xxxx"

blocks_storage:
  backend: s3
  s3:
    region: us-east-1
    endpoint: s3.us-east-1.amazonaws.com
    bucket_name: xxxx
  bucket_store:
    sync_dir: /tsdb/tsdb-sync
  tsdb:
    dir: /tsdb/tsdb

compactor:
  data_dir: /tsdb/compactor
  sharding_ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    kvstore:
      store: memberlist

ingester:
  ring:
    kvstore:
      store: memberlist
    replication_factor: 1

querier:
  timeout: 2m

server:
  http_listen_port: 9009
  log_level: error

store_gateway:
  sharding_ring:
    replication_factor: 1

limits:
  out_of_order_time_window: 30m
  ingestion_rate: 10000000
  ingestion_burst_size: 20000000
  max_global_series_per_user: 0

pracucci commented 1 year ago

Thanks for your report. We fixed some bugs related to OOO ingestion in Mimir 2.7. Could you upgrade to it and let us know if the issue is fixed, please?

pearlriver commented 1 year ago

We did try the latest Mimir version from yesterday:

# TYPE mimir_build_info gauge
mimir_build_info{branch="HEAD",goversion="go1.20.1",revision="dbe4ccd",version="2.7.1"} 1

and added this config: block_ranges_period under tsdb hoping it would help:

 tsdb:
     dir: /tsdb/tsdb
     block_ranges_period: [2h]

but it still crashes with the same error

panic: chunk ID exceeds 3 bytes

aknuds1 commented 1 year ago

Thanks for clarifying that @pearlriver, we will try to get to the bottom of this.

aknuds1 commented 1 year ago

@pearlriver does the stack trace look any different with Mimir 2.7?

aknuds1 commented 1 year ago

@jesusvazquez is looking into the problem here of chunk IDs exceeding 3 bytes.

jesusvazquez commented 1 year ago

:wave: I'm looking into this but its a non trivial bug.

I have a couple of questions.

When you say OOO Merge you mean OOO compaction?
Whats the rate of ingest of out-of-order samples in your instance? Is it high? You can get the rate of ingestion in terms of percentage of total samples received with a query like:

sum(rate(cortex_ingester_tsdb_out_of_order_samples_appended_total[$__rate_interval]))*100 
/ 
sum(rate(cortex_ingester_ingested_samples_total[$__rate_interval]))

This bug is most likely in Prometheus TSDB code not in Mimir, I need some time to find out whats happening.

jesusvazquez commented 1 year ago

Also please could you provide information about this metric: sum(rate(cortex_ingester_tsdb_head_chunks_removed_total[$__rate_interval]))

Try to include a previous compaction that may have worked and one that failed in the chart.

If you are using multiple tenants please group by user.

jesusvazquez commented 1 year ago

Our working theory right now is that you have a high intake of out-of-order and your head has too many chunks to compact that overflows the max 3 bytes to redirect them https://github.com/prometheus/prometheus/blob/38fa151a7cf962376e91dbf7e1aded605ea98f9f/tsdb/ooo_head_read.go#L113

While designing out-of-order we decided to flush chunks with 32 samples so it makes sense that a high rate of OOO produces a lot of chunks. We decided on 32 samples because out of order chunks have uncompressed samples so that would reduce memory pressure. However this is a configurable parameter through the config blocks-storage.tsdb.out-of-order-capacity-max and you can increase it based on your needs.

Before we get ahead of ourselves, lets confirm first whats your ingest rate and the amount of chunks you're producing/removing on truncates to make sure this is the case.

pearlriver commented 1 year ago

@pearlriver does the stack trace look any different with Mimir 2.7?

pretty much the same

created by github.com/grafana/dskit/concurrency.ForEachUser
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/querier.go:694 +0x35
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1130 +0x627
github.com/prometheus/prometheus/tsdb.(*DB).compactOOO(0xc001928000, {0xc0006ad6c8, 0x13}, 0xc07cf12f50)
github.com/prometheus/prometheus/tsdb.(*DB).Compact(0xc001928000)
github.com/prometheus/prometheus/tsdb.(*memSeries).oooMergedChunk(0xc0046a7860, {0x1eae72384, {0x0, 0x0}, 0x1870a0c4811, 0x1870a2a0f14, 0x1eaf606d1, 0x1870a2a0252, 0x1870a2a0f14}, {0x2759300, ...}, ...)
panic: chunk ID exceeds 3 bytes
github.com/prometheus/prometheus/tsdb.OOOHeadChunkReader.Chunk({0xc000f3a000?, 0x80?, 0x46db4e?}, {0x1eae72384, {0x0, 0x0}, 0x1870a0c4811, 0x1870a2a0f14, 0x1eaf606d1, 0x1870a2a0252, ...})
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:694 +0x259
/__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/runner.go:36 +0x125
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1214 +0x34a
github.com/grafana/dskit/concurrency.ForEachUser.func1()
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc000e24700, {0xc12ea77830, 0x1, 0x45?}, 0x1870a0c7000, 0x1870a435e80, {0xc12ea77860, 0x1, 0x1})

/__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/runner.go:45 +0x14d
goroutine 97228896 [running]:
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/querier.go:544 +0x16d
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:817 +0x973
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1170 +0x7a
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc000e24700, {0xc0006ad6c8, 0x13}, {0xc12ea77860?, 0x1, 0x1}, {0xc12ea77830, 0x1, 0x1})
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/ooo_head_read.go:250 +0x105
github.com/grafana/mimir/pkg/ingester.(*userTSDB).Compact(...)
/__w/mimir/mimir/pkg/ingester/user_tsdb.go:149
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Write(0xc000e24700, {0xc0006ad6c8, 0x13}, {0x2747cc0, 0xc0bfc0f400}, 0x1870a0c7000, 0x1870a435e80, 0x0)
github.com/prometheus/prometheus/tsdb.(*DB).compactOOOHead(0xc001928000)
/__w/mimir/mimir/pkg/ingester/ingester.go:2121 +0x3e5
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_read.go:405 +0xba5
github.com/grafana/mimir/pkg/ingester.(*Ingester).compactBlocks.func1({0x2744060?, 0xc00005a0e0?}, {0xc000d0ce90, 0x8})
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:1089 +0x19bf
github.com/prometheus/prometheus/tsdb/chunks.NewHeadChunkRef(...)
github.com/prometheus/prometheus/tsdb.(*populateWithDelGenericSeriesIterator).next(0xc00e02a5a0)
/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/chunks.go:72
github.com/prometheus/prometheus/tsdb.(*populateWithDelChunkSeriesIterator).Next(0xc00e02a5a0)

pearlriver commented 1 year ago

When you say OOO Merge you mean OOO compaction?

As you would have better understanding on the workflow, it's likely correct. I'm just taking oooMergedChunk function name from the log for the title when I say OOO merge.

Whats the rate of ingest of out-of-order samples in your instance? Is it high? You can get the rate of ingestion in terms of percentage of total samples received with a query like:
sum(rate(cortex_ingester_tsdb_out_of_order_samples_appended_total[$__rate_interval]))*100 
/ 
sum(rate(cortex_ingester_ingested_samples_total[$__rate_interval]))
60 - 70 it looks like

Also please could you provide information about this metric: sum(rate(cortex_ingester_tsdb_head_chunks_removed_total[$__rate_interval]))

Try to include a previous compaction that may have worked and one that failed in the chart.

Unfortunately can't include failed compaction as we didn't bind volume to the container so metrics are gone if it has not been pushed to s3.

While designing out-of-order we decided to flush chunks with 32 samples so it makes sense that a high rate of OOO produces a lot of chunks. We decided on 32 samples because out of order chunks have uncompressed samples so that would reduce memory pressure. However this is a configurable parameter through the config blocks-storage.tsdb.out-of-order-capacity-max and you can increase it based on your needs.

Would a higher number of order-capacity-max, let's say the max 255, reduce the chunk ID consumption VS the default 32?

jesusvazquez commented 1 year ago

60 - 70

Thats a lot of out-of-order!! Nice!

So Prometheus uses 3 bytes to redirect all chunks, thats a design limitation. That means we have 2^24 = 16777216 possible chunk ids for each series during the lifetime of a TSDB instance. After a restart you start from 0 again.

We have not yet encountered a usecase where all those chunks are used, to us 2^24 chunks sounds enough, so I have some more questions. Your out-of-order is it scattered across multiple series or would you say that you have most of the out of order in the same series i.e 1 or 2 series receiving most of the out-of-order samples.

If that were the case we have some options:

You can increment the capacity max setting up to 255 so you use less chunks. It would reduce chunk ID consumption. This has a limit though, if you were to ingest even more traffic you'll hit again this limitation.
Is there a way to reduce out-of-order traffic? We don't know much about your usecase, why do you have so much out-of-order? Sometimes the solution could come from where the traffic originates.
If all ooo traffic originated in the same series, is there a way for you to split it into multiple series?

At this point we can only provide guidance on the above options but rather than a bug we think its a limitation. It would be helpful to understand more about your usecase to do so.

About the 3 bytes limitation, it can only be solved in prometheus/prometheus so lets see if we can find a different solution to your problem first and if there isnt a proper one we/you can consider filing an issue in the prometheus tracker about removing it. Hope this sounds reasonable.

pearlriver commented 1 year ago

So Prometheus uses 3 bytes to redirect all chunks, thats a design limitation. That means we have 2^24 = 16777216 possible chunk ids for each series during the lifetime of a TSDB instance. After a restart you start from 0 again.

Would the chunk ID be reset at any point without Mimir having to crash? Like after OOO Compaction finishes? If it never resets then wouldn't any percentage of out-of-order sample use up all the chunk ids in the long run?

Your out-of-order is it scattered across multiple series or would you say that you have most of the out of order in the same series i.e 1 or 2 series receiving most of the out-of-order samples. Is there a way to reduce out-of-order traffic? We don't know much about your usecase, why do you have so much out-of-order? Sometimes the solution could come from where the traffic originates.

it's across multiple series, a service can have multiple containers and each of them sends metric to Mimir individually (to the same series, i.e. no distinction at the container level, all of them use the same service label), so out-of-order sample can happen very easily in the distributed world.

You can increment the capacity max setting up to 255 so you use less chunks. It would reduce chunk ID consumption. This has a limit though, if you were to ingest even more traffic you'll hit again this limitation.

The OOO capacity has been set to 255 and it seems to be stable for the last week. We would like to understand more about the Chunk ID limitation.

What's the max ingestion rate of out-of-order samples that would overflow 2^24 Chunk IDs?
Is the Chunk ID limit per series or per tenant or it's per Mimir instance? Would having more tenants help? Or Would micro-service deployment with more ingestors help reduce the chunk ID consumption?
Is there any config that could have more frequent OOO compaction? (Assuming Chunk ID will be reset after compaction)

At this point we can only provide guidance on the above options but rather than a bug we think its a limitation.

If it's a limitation then this should be mentioned in the Mimir documentation, like in production tips or somewhere else.

jesusvazquez commented 1 year ago

Would the chunk ID be reset at any point without Mimir having to crash? Like after OOO Compaction finishes? If it never resets then wouldn't any percentage of out-of-order sample use up all the chunk ids in the long run?

It never resests during the execution of an instance. The in-order path has the same limitation. 2^24 or 16777216 is a limitation per series per in-order or out-of-order list of chunks. At a scrape interval of 15 seconds the instance can be running for a lot of time so it has never been an issue before for us.

I should add that the reason it never resets is because the head chunk in the TSDB has an incremental ID since former chunks keep being referentiated.

If it never resets then wouldn't any percentage of out-of-order sample use up all the chunk ids in the long run?

This is true for both out of order chunks and in order chunks. If the instance is never restarted it would eventually run out of IDs. I don't have the context from when this was created but since prometheus was pull based it could only go as fast as the scrape interval you configured and I guess it was assumed the instance would eventually restart due to updates or just pods moving around.

What's the max ingestion rate of out-of-order samples that would overflow 2^24 Chunk IDs?

Every new 255 samples a new chunk id is used because the head chunk is flushed and memory mapped to disk. You can read how this works in this design doc, go to Part 1: Ingestion of out-of-order samples and memory mapping

Then we can do some quick numbers 2^24 * 255 = 4278190080 out of order samples on a single series on a single instance can be ingested before you run out of IDs. Then the time you have before you run out varies depending on your ingestion pattern.

Is the Chunk ID limit per series or per tenant or it's per Mimir instance? Would having more tenants help? Or Would micro-service deployment with more ingestors help reduce the chunk ID consumption?

The chunk ID limit is per series. Let me make a clarification here that I hope will help. Mimir has a multi tenant feature that opens a TSDB per tenant. The limitation we-re talking about is in the Prometheus TSDB, not in Mimir. So for explicitness, having more tenants or more ingesters wont help here.

Is there any config that could have more frequent OOO compaction? (Assuming Chunk ID will be reset after compaction)

Compaction does not reset the chunk ID as I said earlier so you don't care about this.

If it's a limitation then this should be mentioned in the Mimir documentation, like in production tips or somewhere else.

That makes sense, thank you. I`ll see what I can do to update this.

grafana / mimir