Open oceanusxiv opened 2 days ago
You are right there isn't an easy way. Definitely something we could add. In the mean time, here is a snippet one could use to derive this:
fragment_sizes = [(f.fragment_id, f.count_rows()) for f in ds.get_fragments()]
offsets = {}
offset = 0
for fragment_id, size in fragment_sizes:
offsets[fragment_id] = offset
offset += size
def row_addr_to_index(row_addr):
fragment_id = row_addr >> 32
row_offset = row_addr & 0xffffffff
row_index = offsets[fragment_id] + row_offset
return row_index
row_addrs = ds.to_table(with_row_address=True)['_rowaddr']
[row_addr_to_index(row_addr.as_py()) for row_addr in row_addrs]
If we do end up adding such a feature at some point I would recommend calling it the "dataset offset" and not "row indices" as I think that is a little less ambiguous.
Here's the use case, I wish to perform a random access
take
of a dataset, given some indices which was obtained by some query beforehand. Effectively I wish to implement my own sampler (I can't use the native sampler because I also want to do some index offsets for lookahead and behind, so it does need to be the contiguous index, not the potentially discontinuous row id).However, there seems to be no easy way currently to do this. Primarily this seems to be due to an asymmetry in the Python API between the
take
function, which expects row indices, and all other query functions, which only return row ids, or row addresses.I realize of course that there is a 1-1 mapping between row addresses, and row indices, but that mapping is hardly straightforward to calculate for the end user, and it would just be super convenient if we can have a
with_row_indices
option in all our query functions so we can obtain this information without so much hassle.If I just missed something and such a method exists, do let me know :)