Open patelprateek opened 1 year ago
We used lance itself to store the diskann graph.
https://github.com/lancedb/lance/blob/main/rust/src/index/vector/graph/persisted.rs.
Lance is optimized for SSD access, but compared to KV stores like rocksdb, Lance does not sorted by key
. It will depend on where you need to find the entry point of the graph.
Lance has an internal meta field called "_rowid", which is how Lance used to identify and access one single row quickly.
thanks for the quick response . good to know its optimized for SSD access , curios to learn more how we do that , i can think of couple ways , want to learn about your findings 1) storing data on persistent stores in an order that requires lesser i/o ? (perhaps neighbors in the same ssd block ) 2) using libraries like io_uring ?
Also I am trying to understand how the columnar storage fits into few use cases ( graph based index , feature store which are typically KV stores (sorted gives some advantage like prefix compression) , inverted index (which is again mostly KV based segments in libraries like lucene)
So lance format itself offers a fast random access when you hold "row ids", with that, it can do sub-linear scan to get each row in the full dataset. Our secondary index (inverted index can be one of them in the future), is a "key -> row id" structure, stored aside the main tables. For example, the vector index stored as a lance file with the schema:
<
pq_code: fixed_size_list<uint8>,
row_ids: uint64
>
Thus the query plan goes into the index, and obtain a few row_ids, Lance then retrives the relevant cells for each of these row ids.
storing data on persistent stores in an order that requires lesser i/o ? (perhaps neighbors in the same ssd block )
We store the graph as adjacent lists, where the neighbour row ids are stored close to the "vertex" itself, in the columns.
Also we do I/O consolidation to make it aligned with 4K block size. We have not used io_uring
yet, mostly just using rust's tokio to issue async I/O at the co-routine level.
Hopefully it answers your quesetions.
@patelprateek let us know if you have any additional questions / concerns here. Otherwise is it ok if we close this as completed this coming week?
Hi I want to use lance file format for serializing some graphs is build. I am curios to learn more about how the columnar format with fast random access helps here ? For example currently i was thinking to use some SSd storage based KV store (ex : rocksdb), would lance provide better performance ? does lance support ssd and async ios ?