Random Crash in Bitpacking/Columnar when Merging Segments

fe9lix commented 5 months ago

Describe the bug The following crash happens "randomly" in merge_thread_0. (Tried to symbolicate the relevant stack trace addresses from the crash log via atos):

tantivy_columnar::columnar::writer::column_operation::ColumnOperation$LT$V$GT$::deserialize::ha05e99b13a0c8619 (in server) + 48
_$LT$serde..__private..de..content..ContentRefDeserializer$LT$E$GT$$u20$as$u20$serde..de..Deserializer$GT$::deserialize_identifier::h757ada0bd20fa7b5 (in server) + 616
bitpacking::bitpacker4x::neon::pack_unpack_with_bits_20::unpack::h3f79e33fb6d87813 (in server) + 1928
HUFv05_readDTableX4 (in server) + 996
_$LT$alloc..vec..drain..Drain$LT$T$C$A$GT$$u20$as$u20$core..ops..drop..Drop$GT$::drop::h9aae7105e6a8c162 (in server) + 24
bitpacking::bitpacker4x::scalar::pack_unpack_with_bits_13::unpack::h453c2adfdc5e1fdc (in server) + 2256

Some more context:

This happens in a release build that was built on a CI machine (GitHub macos-14-xlarge ARM M1 runner).
A local release build does not reproduce the issue, although the environment is the same: macOS ARM M1, same Rust toolchain (1.78.0 but same with earlier versions), locked dependencies.
Previously, we were using a 1.77.x toolchain with this release profile:

[profile.release]
strip = "debuginfo"
opt-level = 3
lto = true
codegen-units = 1
panic = "unwind"

causing this backtrace:

thread 'thrd-tantivy-index2' panicked at /Users/runner/.cargo/.../bitpacker/src/bitpacker.rs:79:9:
assertion failed: num_bits <= 7 * 8 || num_bits == 64
stack backtrace:
thread 'thrd-tantivy-index0' panicked at /Users/runner/.cargo/git/.../bitpacker/src/bitpacker.rs:79:9:
assertion failed: num_bits <= 7 * 8 || num_bits == 64
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: <tantivy_columnar::column_values::u64_based::blockwise_linear::BlockwiseLinearEstimator as tantivy_columnar::column_values::u64_based::ColumnCodecEstimator>::serialize
   4: tantivy_columnar::column_values::u64_based::serialize_u64_based_column_values
   5: tantivy_columnar::columnar::writer::ColumnarWriter::serialize
   6: tantivy::indexer::segment_writer::SegmentWriter::finalize

This all makes me think that it must have something to do with the conditional compilation in the bitpacking crate. For instance, the CI machine (but not the local machine) additionally support the SS3E instruction set. However, this shouldn't matter since at runtime the CPU feature detection should select the right implementation? So it doesn't really explain why the CI build would behave differently, yet the stack trace points to this area. I wonder what I could be missing.

Some questions:

How are the different bitpacking packages being used? There's one direct bitpacking dependency in the main Cargo.toml with feature bitpacker4x and another one used via tantivy_bitpacker with feature bitpacker1x.
Could there be any possible runtime effects that a bitpacking crate compiled with a different CPU feature set could cause? (This is the only difference between the machines that I'm aware of...)
Any ideas how I could further narrow down the issue? (Other build settings, other differences in the CI environment, custom tantivy build with different bitpacking features, ...)

Which version of tantivy are you using? 0.22.0 stable (but also happened on master between 0.21 and 0.22)

To Reproduce Unfortunately, there's no clear reproduction; the bug seems to happen intermittently at some point when adding a large number of documents to the index and segments are being merged (judging by stack trace for the crashed merge_thread_0).

PSeitz commented 5 months ago

Thanks for the bug report. The first stacktrace from atos is garbled, it doesn't make sense.

The second stacktrace does make sense, but looking at the code and I don't see how that could happen. It uses the tantivy_bitpacker, which does not use any SIMD code in that path.

Can you run it with a modified version of tantivy? I could push some changes to a branch, to get more context information, when the panic occurs. Is the GH repo public?

fulmicoton commented 5 months ago

The code you point at is NOT using SIMD. SIMD bitpacking uses a different layout that prevents efficient random access, and we need random access for columns.

Like @PSeitz, I have no clue where it could come from.

There are two calls to BitUnpacker::new in this file. The first one happens in deserialization of the existing column. It is probably called as we consume the dyn Iterator, but I think the stacktrace would have looked different if it was the one hitting.

The second one happens on code that is rather straightforward.

bit_width is obtained via: let bit_width = buffer.iter().copied().map(compute_num_bits).max().unwrap();

pub fn compute_num_bits(n: u64) -> u8 {
    let amplitude = (64u32 - n.leading_zeros()) as u8;
    if amplitude <= 64 - 8 {
        amplitude
    } else {
        64
    }
}

As far as I can tell, all number exiting this function, and their max, should match the assert. If it is easy to reproduce, we can add a couple of asserts here and there, and more info on the value triggering the assert.

If you can share the segments and the schema triggering the assert, this would be even more helpful. Rust 1.78 had an LLVM upgrade so it could even be an actual compiler bug.

Also if you have some kind of compiler cache in your CI? can you try and clean it or disable it and see if your problem gets solved?

fe9lix commented 5 months ago

Thanks for getting back. Sorry that the first stack trace is not more helpful. That's just the atos output for a couple of memory addresses from the crash log to see whether bitpacking showed up somewhere.

It all doesn't make sense to me either, so I tried comparing the build environment and the different CPU features was the only one I could spot so far. (But, as far as I understand, SIMD is not involved and even if there were compile time differences, it wouldn't explain the different runtime behavior because of feature detection.)

Now I tried running a CI build again with the old config and got an interesting new permutation of the crash this time (right after running the binary on the command line): 53287 illegal hardware instruction so it tries running a machine instruction that is not supported by the CPU.

Here's the build config:

rust-toolchain.toml

[toolchain]
channel = "1.76.0"
targets = ["aarch64-apple-darwin", "x86_64-apple-darwin"]

Cargo.toml

[profile.release]
strip = "debuginfo"
opt-level = 3
lto = true
codegen-units = 1
panic = "unwind"

A build with toolchain 1.78.0 and panic = "abort" as only release profile option has been running stable so far.

Re CI and caching: There should be a fresh Github runner for each run and we don't do any custom caching in the action. For the toolchain setup we've been using this action, then set up cargo make via this action, run cargo clean, and then build via:

command = "cargo"
args = [
    "build",
    "--release",
    "--target",
    "aarch64-apple-darwin",
    "--locked",
    "-p",
    "server",

I'll check again the Github actions to see if there could be any effects.

fe9lix commented 2 months ago

I couldn't reproduce it any more since we've removed third-party actions from our GitHub CI workflow. (Btw, we've found that the macOS runner by default supports the Rust toolchain, so there's no need for additional setup actions.)

quickwit-oss / tantivy

Random Crash in Bitpacking/Columnar when Merging Segments #2384