Open kubicgruenfeld opened 1 year ago
:wave: We're looking into this.
We think it could possibly be a bug in Prometheus where the tsdb does not properly unset the OOO write behind log if OOO is disabled.
Could you please clarify how did you disable the out-of-order window? Did you do it by submitting an override and disabling OOO for a tenant or did you modify the mimir's ingester config and rolled them out?
Any chance you could also share your mimir config? I'm particularly interested in the value of -blocks-storage.tsdb.wal-segment-size-bytes
in case you set an override.
Could you please clarify how did you disable the out-of-order window? Did you do it by submitting an override and disabling OOO for a tenant or did you modify the mimir's ingester config and rolled them out?
We enabled and disabled it globally and rolled them out.
Any chance you could also share your mimir config? I'm particularly interested in the value of -blocks-storage.tsdb.wal-segment-size-bytes in case you set an override.
This is our whole tsdb config, we didn't alter anything there from the helm defaults:
tsdb:
dir: /data/tsdb
block_ranges_period:
- 2h0m0s
retention_period: 24h0m0s
ship_interval: 1m0s
ship_concurrency: 10
head_compaction_interval: 1m0s
head_compaction_concurrency: 1
head_compaction_idle_timeout: 1h0m0s
head_chunks_write_buffer_size_bytes: 4194304
head_chunks_end_time_variance: 0
stripe_size: 16384
wal_compression_enabled: false
wal_segment_size_bytes: 134217728
flush_blocks_on_shutdown: false
close_idle_tsdb_timeout: 13h0m0s
memory_snapshot_on_shutdown: false
head_chunks_write_queue_size: 1000000
series_hash_cache_max_size_bytes: 1073741824
max_tsdb_opening_concurrency_on_startup: 10
out_of_order_capacity_max: 32
Thanks for looking into it!
cc @kubicgruenfeld This PR is working towards removing the bug from Prometheus https://github.com/prometheus/prometheus/pull/11962 once merged we will be able to solve this in Mimir but probably in future versions only.
From the context you provided in the ticket I see that you disabled out-of-order because you started having MimirIngesterTSDBHeadCompactionFailed
errors. For that, which is a separated issue from this one, I do recommend upgrading to Mimir 2.6.0 because between Mimir 2.5.0 and Mimir 2.6.0 another Prometheus bug was fixed https://github.com/prometheus/prometheus/pull/11623. So I think the best course of action for you would be to upgrade to Mimir 2.6.0 and re-enable out-of-order ingestion.
On our side we will work towards solving this new bug, thanks for reporting it. Once solved we will let you know.
Describe the bug
We did enable out of order window a few days back. Since we got this alert today, MimirIngesterTSDBHeadCompactionFailed, we saw, that since enabling out of order window the metric cortex_ingester_tsdb_compactions_failed_total started to show failures. This behaviour may to be related to #4160.
Now, we tried to disabling out of order window again, but this resulted in segfaults of our ingesters:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
No segfaults, out of order tsdb blocks should be ignored or removed.
Environment
Additional Context