Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
When using binary or string columns a single batch of data cannot contain more than 2GiB of data. Users will either need to use large_binary and large_string or make sure to set a custom batch size when reading this data.
However, the error they run into, an "offset overflow" error, is a panic (not great) and very confusing. It is not obvious that the solution is to reduce the batch size:
thread 'lance_background_thread' panicked at .../arrow-data-52.2.0/src/transform/utils.rs:42:56:
offset overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at .../rust/lance-encoding/src/decoder.rs:1267:65:
called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(12814), ...)
Ideally we should be returning an Err here (not panic) and the message should say something like "Could not create array with more than 2GiB of string/binary data. Please try reducing the batch_size."
When using
binary
orstring
columns a single batch of data cannot contain more than 2GiB of data. Users will either need to uselarge_binary
andlarge_string
or make sure to set a custom batch size when reading this data.However, the error they run into, an "offset overflow" error, is a panic (not great) and very confusing. It is not obvious that the solution is to reduce the batch size:
Ideally we should be returning an
Err
here (not panic) and the message should say something like "Could not create array with more than 2GiB of string/binary data. Please try reducing the batch_size."