lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.65k stars 195 forks source link

Lance file format for graphs #1025

Open patelprateek opened 1 year ago

patelprateek commented 1 year ago

Hi I want to use lance file format for serializing some graphs is build. I am curios to learn more about how the columnar format with fast random access helps here ? For example currently i was thinking to use some SSd storage based KV store (ex : rocksdb), would lance provide better performance ? does lance support ssd and async ios ?

eddyxu commented 1 year ago

We used lance itself to store the diskann graph.

https://github.com/lancedb/lance/blob/main/rust/src/index/vector/graph/persisted.rs.

Lance is optimized for SSD access, but compared to KV stores like rocksdb, Lance does not sorted by key. It will depend on where you need to find the entry point of the graph.

Lance has an internal meta field called "_rowid", which is how Lance used to identify and access one single row quickly.

patelprateek commented 1 year ago

thanks for the quick response . good to know its optimized for SSD access , curios to learn more how we do that , i can think of couple ways , want to learn about your findings 1) storing data on persistent stores in an order that requires lesser i/o ? (perhaps neighbors in the same ssd block ) 2) using libraries like io_uring ?

Also I am trying to understand how the columnar storage fits into few use cases ( graph based index , feature store which are typically KV stores (sorted gives some advantage like prefix compression) , inverted index (which is again mostly KV based segments in libraries like lucene)

eddyxu commented 1 year ago

So lance format itself offers a fast random access when you hold "row ids", with that, it can do sub-linear scan to get each row in the full dataset. Our secondary index (inverted index can be one of them in the future), is a "key -> row id" structure, stored aside the main tables. For example, the vector index stored as a lance file with the schema:

<
  pq_code: fixed_size_list<uint8>,
  row_ids: uint64
>

Thus the query plan goes into the index, and obtain a few row_ids, Lance then retrives the relevant cells for each of these row ids.

storing data on persistent stores in an order that requires lesser i/o ? (perhaps neighbors in the same ssd block )

We store the graph as adjacent lists, where the neighbour row ids are stored close to the "vertex" itself, in the columns.

Also we do I/O consolidation to make it aligned with 4K block size. We have not used io_uring yet, mostly just using rust's tokio to issue async I/O at the co-routine level.

Hopefully it answers your quesetions.

changhiskhan commented 1 year ago

@patelprateek let us know if you have any additional questions / concerns here. Otherwise is it ok if we close this as completed this coming week?