lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.77k stars 206 forks source link

No. Row group? #2540

Open mengjie09 opened 2 months ago

mengjie09 commented 2 months ago

According to the lance file layout, the current lance V2 cancels the concept of row group. What is the relationship between DataFragment and row group in the code? The DataFragment concept describes how to express different numbers of rows in different columns of the same row. Is this function implemented?

mengjie09 commented 2 months ago

According to the lance file layout, the current lance V2 cancels the concept of row group. What is the relationship between DataFragment and row group in the code? The DataFragment concept describes how to express different numbers of rows in different columns of the same row. Is this function implemented?

wjones127 commented 2 months ago

DataFragment is a table-level concept. It has a fixed number of rows. When you first write data, it typically corresponds to a single data file. This is different than a row group. Row groups are inside files; as in, there are multiple row groups in a file. But Lance V2 doesn't have row groups.

The layout of data fragments is described here: https://lancedb.github.io/lance/format.html#fragments

mengjie09 commented 2 months ago

Thank you. Here's another question. If lance supports different number of rows for different columns, and DataFragment needs to have the same number of rows, how is this DataFragment represented? Is this expressed in one DataFragment, or different DataFragments?

wjones127 commented 2 months ago

If lance supports different number of rows for different columns

Each file must have the same number of rows per column. No row groups means there isn't a smaller unit that is required to have the same number of rows per column.