Segfault when disabling out of order window

kubicgruenfeld commented 1 year ago

Describe the bug

We did enable out of order window a few days back. Since we got this alert today, MimirIngesterTSDBHeadCompactionFailed, we saw, that since enabling out of order window the metric cortex_ingester_tsdb_compactions_failed_total started to show failures. This behaviour may to be related to #4160.

Now, we tried to disabling out of order window again, but this resulted in segfaults of our ingesters:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xfe2c7d]

goroutine 420 [running]:
github.com/prometheus/prometheus/tsdb/wal.(*WAL).NextSegmentSync(0x0)
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/wal/wal.go:478 +0x3d
github.com/prometheus/prometheus/tsdb.NewOOOCompactionHead(0xc000698fc0)
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/ooo_head_read.go:276 +0x3a
github.com/prometheus/prometheus/tsdb.(*DB).compactOOOHead(0xc000280a50)
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1136 +0x4f
github.com/prometheus/prometheus/tsdb.(*DB).Compact(0xc000280a50)
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1101 +0x627
github.com/grafana/mimir/pkg/ingester.(*Ingester).createTSDB(0xc0009f5500, {0xc0005c87d3, 0x9})
    /__w/mimir/mimir/pkg/ingester/ingester.go:1613 +0xab6
github.com/grafana/mimir/pkg/ingester.(*Ingester).openExistingTSDB.func1()
    /__w/mimir/mimir/pkg/ingester/ingester.go:1703 +0x165
golang.org/x/sync/errgroup.(*Group).Go.func1()
    /__w/mimir/mimir/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
    /__w/mimir/mimir/vendor/golang.org/x/sync/errgroup/errgroup.go:72 +0xa5

To Reproduce

Steps to reproduce the behavior:

start Mimir 2.5
enable out of order window
push out of order series
disable out of order window

Expected behavior

No segfaults, out of order tsdb blocks should be ignored or removed.

Environment

Infrastructure: Kubernetes
Deployment tool: helm

Additional Context

jesusvazquez commented 1 year ago

:wave: We're looking into this.

We think it could possibly be a bug in Prometheus where the tsdb does not properly unset the OOO write behind log if OOO is disabled.

Could you please clarify how did you disable the out-of-order window? Did you do it by submitting an override and disabling OOO for a tenant or did you modify the mimir's ingester config and rolled them out?

jesusvazquez commented 1 year ago

Any chance you could also share your mimir config? I'm particularly interested in the value of -blocks-storage.tsdb.wal-segment-size-bytes in case you set an override.

kubicgruenfeld commented 1 year ago

Could you please clarify how did you disable the out-of-order window? Did you do it by submitting an override and disabling OOO for a tenant or did you modify the mimir's ingester config and rolled them out?

We enabled and disabled it globally and rolled them out.

Any chance you could also share your mimir config? I'm particularly interested in the value of -blocks-storage.tsdb.wal-segment-size-bytes in case you set an override.

This is our whole tsdb config, we didn't alter anything there from the helm defaults:

    tsdb:
        dir: /data/tsdb
        block_ranges_period:
            - 2h0m0s
        retention_period: 24h0m0s
        ship_interval: 1m0s
        ship_concurrency: 10
        head_compaction_interval: 1m0s
        head_compaction_concurrency: 1
        head_compaction_idle_timeout: 1h0m0s
        head_chunks_write_buffer_size_bytes: 4194304
        head_chunks_end_time_variance: 0
        stripe_size: 16384
        wal_compression_enabled: false
        wal_segment_size_bytes: 134217728
        flush_blocks_on_shutdown: false
        close_idle_tsdb_timeout: 13h0m0s
        memory_snapshot_on_shutdown: false
        head_chunks_write_queue_size: 1000000
        series_hash_cache_max_size_bytes: 1073741824
        max_tsdb_opening_concurrency_on_startup: 10
        out_of_order_capacity_max: 32

Thanks for looking into it!

jesusvazquez commented 1 year ago

cc @kubicgruenfeld This PR is working towards removing the bug from Prometheus https://github.com/prometheus/prometheus/pull/11962 once merged we will be able to solve this in Mimir but probably in future versions only.

From the context you provided in the ticket I see that you disabled out-of-order because you started having MimirIngesterTSDBHeadCompactionFailed errors. For that, which is a separated issue from this one, I do recommend upgrading to Mimir 2.6.0 because between Mimir 2.5.0 and Mimir 2.6.0 another Prometheus bug was fixed https://github.com/prometheus/prometheus/pull/11623. So I think the best course of action for you would be to upgrade to Mimir 2.6.0 and re-enable out-of-order ingestion.

On our side we will work towards solving this new bug, thanks for reporting it. Once solved we will let you know.

grafana / mimir