quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.02k stars 670 forks source link

Random Crash in Bitpacking/Columnar when Merging Segments #2384

Closed fe9lix closed 2 months ago

fe9lix commented 5 months ago

Describe the bug The following crash happens "randomly" in merge_thread_0. (Tried to symbolicate the relevant stack trace addresses from the crash log via atos):

tantivy_columnar::columnar::writer::column_operation::ColumnOperation$LT$V$GT$::deserialize::ha05e99b13a0c8619 (in server) + 48
_$LT$serde..__private..de..content..ContentRefDeserializer$LT$E$GT$$u20$as$u20$serde..de..Deserializer$GT$::deserialize_identifier::h757ada0bd20fa7b5 (in server) + 616
bitpacking::bitpacker4x::neon::pack_unpack_with_bits_20::unpack::h3f79e33fb6d87813 (in server) + 1928
HUFv05_readDTableX4 (in server) + 996
_$LT$alloc..vec..drain..Drain$LT$T$C$A$GT$$u20$as$u20$core..ops..drop..Drop$GT$::drop::h9aae7105e6a8c162 (in server) + 24
bitpacking::bitpacker4x::scalar::pack_unpack_with_bits_13::unpack::h453c2adfdc5e1fdc (in server) + 2256

Some more context:

[profile.release]
strip = "debuginfo"
opt-level = 3
lto = true
codegen-units = 1
panic = "unwind"

causing this backtrace:

thread 'thrd-tantivy-index2' panicked at /Users/runner/.cargo/.../bitpacker/src/bitpacker.rs:79:9:
assertion failed: num_bits <= 7 * 8 || num_bits == 64
stack backtrace:
thread 'thrd-tantivy-index0' panicked at /Users/runner/.cargo/git/.../bitpacker/src/bitpacker.rs:79:9:
assertion failed: num_bits <= 7 * 8 || num_bits == 64
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: <tantivy_columnar::column_values::u64_based::blockwise_linear::BlockwiseLinearEstimator as tantivy_columnar::column_values::u64_based::ColumnCodecEstimator>::serialize
   4: tantivy_columnar::column_values::u64_based::serialize_u64_based_column_values
   5: tantivy_columnar::columnar::writer::ColumnarWriter::serialize
   6: tantivy::indexer::segment_writer::SegmentWriter::finalize

This all makes me think that it must have something to do with the conditional compilation in the bitpacking crate. For instance, the CI machine (but not the local machine) additionally support the SS3E instruction set. However, this shouldn't matter since at runtime the CPU feature detection should select the right implementation? So it doesn't really explain why the CI build would behave differently, yet the stack trace points to this area. I wonder what I could be missing.

Some questions:

Which version of tantivy are you using? 0.22.0 stable (but also happened on master between 0.21 and 0.22)

To Reproduce Unfortunately, there's no clear reproduction; the bug seems to happen intermittently at some point when adding a large number of documents to the index and segments are being merged (judging by stack trace for the crashed merge_thread_0).

PSeitz commented 5 months ago

Thanks for the bug report. The first stacktrace from atos is garbled, it doesn't make sense.

The second stacktrace does make sense, but looking at the code and I don't see how that could happen. It uses the tantivy_bitpacker, which does not use any SIMD code in that path.

Can you run it with a modified version of tantivy? I could push some changes to a branch, to get more context information, when the panic occurs. Is the GH repo public?

fulmicoton commented 5 months ago

The code you point at is NOT using SIMD. SIMD bitpacking uses a different layout that prevents efficient random access, and we need random access for columns.

Like @PSeitz, I have no clue where it could come from.

There are two calls to BitUnpacker::new in this file. The first one happens in deserialization of the existing column. It is probably called as we consume the dyn Iterator, but I think the stacktrace would have looked different if it was the one hitting.

The second one happens on code that is rather straightforward.

bit_width is obtained via: let bit_width = buffer.iter().copied().map(compute_num_bits).max().unwrap();

pub fn compute_num_bits(n: u64) -> u8 {
    let amplitude = (64u32 - n.leading_zeros()) as u8;
    if amplitude <= 64 - 8 {
        amplitude
    } else {
        64
    }
}

As far as I can tell, all number exiting this function, and their max, should match the assert. If it is easy to reproduce, we can add a couple of asserts here and there, and more info on the value triggering the assert.

If you can share the segments and the schema triggering the assert, this would be even more helpful. Rust 1.78 had an LLVM upgrade so it could even be an actual compiler bug.

Also if you have some kind of compiler cache in your CI? can you try and clean it or disable it and see if your problem gets solved?

fe9lix commented 5 months ago

Thanks for getting back. Sorry that the first stack trace is not more helpful. That's just the atos output for a couple of memory addresses from the crash log to see whether bitpacking showed up somewhere.

It all doesn't make sense to me either, so I tried comparing the build environment and the different CPU features was the only one I could spot so far. (But, as far as I understand, SIMD is not involved and even if there were compile time differences, it wouldn't explain the different runtime behavior because of feature detection.)

Now I tried running a CI build again with the old config and got an interesting new permutation of the crash this time (right after running the binary on the command line): 53287 illegal hardware instruction so it tries running a machine instruction that is not supported by the CPU.

Here's the build config:

rust-toolchain.toml

[toolchain]
channel = "1.76.0"
targets = ["aarch64-apple-darwin", "x86_64-apple-darwin"]

Cargo.toml

[profile.release]
strip = "debuginfo"
opt-level = 3
lto = true
codegen-units = 1
panic = "unwind"

A build with toolchain 1.78.0 and panic = "abort" as only release profile option has been running stable so far.

Re CI and caching: There should be a fresh Github runner for each run and we don't do any custom caching in the action. For the toolchain setup we've been using this action, then set up cargo make via this action, run cargo clean, and then build via:

command = "cargo"
args = [
    "build",
    "--release",
    "--target",
    "aarch64-apple-darwin",
    "--locked",
    "-p",
    "server",

I'll check again the Github actions to see if there could be any effects.

fe9lix commented 2 months ago

I couldn't reproduce it any more since we've removed third-party actions from our GitHub CI workflow. (Btw, we've found that the macOS runner by default supports the Rust toolchain, so there's no need for additional setup actions.)