lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

No easy way to obtain row index from queries? #3140

Open oceanusxiv opened 2 days ago

oceanusxiv commented 2 days ago

Here's the use case, I wish to perform a random access take of a dataset, given some indices which was obtained by some query beforehand. Effectively I wish to implement my own sampler (I can't use the native sampler because I also want to do some index offsets for lookahead and behind, so it does need to be the contiguous index, not the potentially discontinuous row id).

However, there seems to be no easy way currently to do this. Primarily this seems to be due to an asymmetry in the Python API between the take function, which expects row indices, and all other query functions, which only return row ids, or row addresses.

I realize of course that there is a 1-1 mapping between row addresses, and row indices, but that mapping is hardly straightforward to calculate for the end user, and it would just be super convenient if we can have a with_row_indices option in all our query functions so we can obtain this information without so much hassle.

If I just missed something and such a method exists, do let me know :)

wjones127 commented 1 day ago

You are right there isn't an easy way. Definitely something we could add. In the mean time, here is a snippet one could use to derive this:

fragment_sizes = [(f.fragment_id, f.count_rows()) for f in ds.get_fragments()]
offsets = {}
offset = 0
for fragment_id, size in fragment_sizes:
    offsets[fragment_id] = offset
    offset += size

def row_addr_to_index(row_addr):
    fragment_id = row_addr >> 32
    row_offset = row_addr & 0xffffffff
    row_index = offsets[fragment_id] + row_offset
    return row_index

row_addrs = ds.to_table(with_row_address=True)['_rowaddr']
[row_addr_to_index(row_addr.as_py()) for row_addr in row_addrs]
westonpace commented 1 day ago

If we do end up adding such a feature at some point I would recommend calling it the "dataset offset" and not "row indices" as I think that is a little less ambiguous.