scipp / scippnexus

h5py-like utility for NeXus files with seamless scipp integration
https://scipp.github.io/scippnexus/
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

Support label-based indexing #191

Closed SimonHeybrock closed 7 months ago

SimonHeybrock commented 8 months ago

We have always had in mind to implement label-based indexing like Scipp. This is required in particular when loading multiple groups at once, where positional indexing is often not meaningful.

In fact, we may consider the current behavior a bug, or at least a discrepancy to how scipp.DataGroup behaves: When loading a tree of groups, the dimension length may be ambiguous (indicated by None). ScippNexus nevertheless accepts positional indices. Therefore, we get inconsistent behavior:

with snx.File(name) as f:
    dg1 = f['time', 10:20]  # may return some zero-length subgroups, if indices out of range
    dg2 = f[()]['time', 10:20]  # raises if sizes is {'time': None, ...}

See scipp/scipp#3371.

jokasimr commented 8 months ago

Just so I understand the issue: To do label based indexing we need to go through the entire array, as long as the data is not sorted. Then the benefit of doing label based indexing directly in scippnexus instead of reading the entire array and then doing the label based indexing in scipp is that it lets us do label based indexing on arrays that are larger than RAM. Is that is the motivation for adding the feature?

If that is the case then I suggest implementing it by reading chunks of the data, doing label based indexing on each chunk using scipp, and concatenating the results. Does that make sense? The benefit I see of doing it that way is that we don't need to re-implement the label based indexing logic in scippnexus.

SimonHeybrock commented 8 months ago

To do label based indexing we need to go through the entire array, as long as the data is not sorted.

1.) we only need to go through the coord, and 2.) Scipp currently only supports ordered coords.

We do not have to read chunks of data. I think it is much simpler to assume that the coord dataset is "small" (it might have a lower dimensionality), read that, then read the relevant section of the remaining data.

SimonHeybrock commented 8 months ago

There used to be Python bindings in Scipp for the helper function that takes a coord and returns the indices resulting from label-based indexing with the coord. It might be that it was removed, but it should be simple to add back, if required.

jokasimr commented 7 months ago

There used to be Python bindings in Scipp for the helper function that takes a coord and returns the indices resulting from label-based indexing with the coord. It might be that it was removed, but it should be simple to add back, if required.

Do you remember what that function was / is called? It would be very useful here.

SimonHeybrock commented 7 months ago

The bindings were removed here: https://github.com/scipp/scipp/pull/3200/files#r1290898024. I am relatively sure the underlying C++ has not changed since then, i.e., you can add it back in exactly as removed there.