lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.85k stars 212 forks source link

Offset overflow errors can be confusing for users #2775

Open westonpace opened 1 month ago

westonpace commented 1 month ago

When using binary or string columns a single batch of data cannot contain more than 2GiB of data. Users will either need to use large_binary and large_string or make sure to set a custom batch size when reading this data.

However, the error they run into, an "offset overflow" error, is a panic (not great) and very confusing. It is not obvious that the solution is to reduce the batch size:

thread 'lance_background_thread' panicked at .../arrow-data-52.2.0/src/transform/utils.rs:42:56:
offset overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at .../rust/lance-encoding/src/decoder.rs:1267:65:
called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(12814), ...)

Ideally we should be returning an Err here (not panic) and the message should say something like "Could not create array with more than 2GiB of string/binary data. Please try reducing the batch_size."

broccoliSpicy commented 1 month ago

@klibiadam looks like you have a very good start with this issue!

Would you like me to assign this issue to you?