lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.86k stars 213 forks source link

Replace the page table #1809

Open wjones127 opened 9 months ago

wjones127 commented 9 months ago

The page table has several flaws, which needs to be addressed at some point.

Issues with the current design

Problem 1: The length of the page table is poorly defined

The page table is a table of (offset, length) pairs pointing to pages in the file. It is written as an flat array, where each field has it’s pages written contiguously.

<field 0 page 0>...<field 0 page N><field 1 page 0>...<field 1 page N>...

The page table is pointed to by a single offset in the manifest: Metadata.page_table_position. This presents the first problem: The length of the page table is poorly defined. The Lance reader needs to infer the correct length of the page table for a given file. Right now, it takes the minimum of the field ids in the schema, the maximum of the field ids, and subtracts those to get the number of fields. Then it multiplies by the number of batches to get the total size of the array. If anything in this calculation is wrong, then its easy to accidentally read past the end of the file, or just read invalid data.

Problem 2: The page table assumes field ids are contiguous and always contain buffers

In Lance, schemas may start out with a contiguous set of field ids that start at 0. However, as schemas evolve, columns may be dropped and holes will develop in the field ids.

Some field types in Lance don’t contain buffers. For example, struct fields themselves don’t contain any buffers. Their children do, but those are separate fields. But because every field needs to have entries in the page table, their entries are filled with (0, 0) (zero offset, zero length).

The page table assumes field ids are contiguous and always contain buffers. In practice, it often needs to fill the table with zeros to handle these fields, wasting significant space.

Problem 3: The page table assumes all fields have same number of pages

The page table assumes that each column has the same number of pages and that pages in each position are equal in length. This same assumption is made in other parts of the format, such as the layout of the page statistics.

However, this presents a problem for optimal performance in Lance: for small columns, such as primitive ones, the optimal page size is large. There is a 5x performance difference between pages of size 10240 and 1024 when scanning. Meanwhile, for large columns, it’s better to have smaller pages, both for scan performance and to reduce the amount of data that must be buffered.

De-coupling the page sizes of different columns would involve more than just changes to the page table, but the current design of the page table notably does not allow this.

Proposed solution

First, make the Metadata also store the length of the page table. This will make it more robust to detect the size of it.

Second, reformat the page table as a protobuf message, so that it can be adapted in the future. Each of the entries should also store the specific field_id it corresponds to, removing the guess work of which field they correspond to.

message PageTable {
    repeated FieldPages fields = 1;
}

message FieldPages {
    int32 field_id = 1;
    repeated pages PageInfo = 2;
}

message PageInfo {
    int64 offset = 1;
    int64 length = 2;
}

The structure of the messages means this can be larger than the flag encoding, however this is offset by the fact that:

  1. We will no longer be filing zeros for fields that have no pages (struct fields and missing fields)
  2. We can compress this message when serializing, which may bring additional IO savings.

For now, this could have the constraint that each of the fields must have the same number of pages. But it does allow for that to be lifted in the future.

eddyxu commented 9 months ago

Originally, storing the page table directly on disk was for that we can quickly do O(1) 16 bytes disk read to take one offset.

Can we take into account that the new design is friendly with very sparse take (i.e., open a file with 1000+ groups and take 1 row out) as well?

Another minor thing is if we store the offset and length continuously.

like

Field {
  int32 field_id = 1;
  repeated int64 offset = 2;
  repeated int64 length = 2;
} 

When we have a fixed max_row_per_group, offsets / length array uses delta of delta encoding, it almost takes no space for many fields we care about (int, floats, boolean, vectors, fixed_shaped_tensor).

wjones127 commented 9 months ago

Originally, storing the page table directly on disk was for that we can quickly do O(1) 16 bytes disk read to take one offset.

That's good to know. I hadn't considered that part, since I didn't see that in the codebase yet.

eddyxu commented 9 months ago

Btw, you might want to consider compression and nullability support as well. Depend on the final design, there might be another level below "page" level to get decent random IOs. cc @westonpace

wjones127 commented 9 months ago

Yes, I don't think we can replace the page table unless we know what our plan is for nullability and encodings.