scipp / scippnexus

h5py-like utility for NeXus files with seamless scipp integration
https://scipp.github.io/scippnexus/
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

feat: label based indexing for nxdata #194

Closed jokasimr closed 4 months ago

jokasimr commented 5 months ago

Fixes #191

SimonHeybrock commented 4 months ago

Can you add the equivalent of https://github.com/scipp/scippnexus/blob/58ffbbe67e90bf3341a7589e89e00e36859ba712/tests/nxdetector_test.py#L279-L310, but with label-based indexing instead of positional indexing? Note that this is a special case because the event_time_zero coord is typically in a subgroup that is handled as a special "event field".

jokasimr commented 4 months ago

I'm unsure about the NXlog case. Perhaps we should talk about that in person.

jokasimr commented 4 months ago

Okay there are a couple of issues with this implementation.

Something seems wrong with label based slicing for DataGroup

Example from the docs:

da = sc.DataArray(
    data=sc.array(dims=['time', 'x'], values=np.random.random((3, 7))),
    coords={
        'x': sc.array(dims=['x'], values=np.linspace(0.1, 0.9, num=7), unit='m'),
        'year': sc.array(dims=['time'], values=[2020, 2023, 2027]),
    },
)
da['year', sc.scalar(2023)].dims # (x,)

everything works as expected for DataArray.

But for DataGroup

dg = sc.DataGroup(
    a=sc.DataArray(
        data=sc.array(dims=['time', 'x'], values=np.random.random((3, 7))),
        coords={
            'x': sc.array(dims=['x'], values=np.linspace(0.1, 0.9, num=7), unit='m'),
            'year': sc.array(dims=['time'], values=[2020, 2023, 2027]),
        },
    ),
)
dg['year', sc.scalar(2023)].dims   # ('time', 'x')

the data is not sliced. I expected the above to produce a data group with a mapped to the sliced DataArray.

SimonHeybrock commented 4 months ago

Initially scipp.DataArray was only support label-based indexing with dimension-coords (i.e., coord name matching dim names). It seems scipp.DataGroup still does this... and it might actually be hard/unrealistic to change since we would otherwise need to recursively get a listing of all coords of all children. scipp.DataGroup does not "know" about coords, this is not a change I would like to make.

Another approach might be to simply forward all slicing to children, and let them decide if it applies. But that would require not raising on label-based indexing if the coord does not exist, which seems even worse.

jokasimr commented 4 months ago

Initially my feeling was that your last suggestion was the most sensible approach, letting each leaf in the data group determine if the slicing is applicable and behave accordingly. This seems to best match most realistic use cases, for example, slicing multiple different detectors / measurements to a particular time interval.

But this has some drawbacks

SimonHeybrock commented 4 months ago

In conclusion, I think we should not change the DataGroup behavior right now (but maybe add a note in the docstring about the difference to DataArray).

jokasimr commented 4 months ago

Then I guess we should also make the scippnexus label based slicing behave the same, that is, when slicing a group only accept slicing dim-coords, and when slicing a data array accept any coord

SimonHeybrock commented 4 months ago

I think that sounds ok for now!

jokasimr commented 4 months ago

The tests are failing because I included the test that compares to label slicing a data group in scipp, just to illustrate the problem.

I think this is ready for another review.

jokasimr commented 4 months ago

The error messages should probably be changed a bit, they are not very clear.

jokasimr commented 4 months ago

Admittedly this is not pretty. But the tests pass!

Maybe we can refactor it to make it less of a hack.

SimonHeybrock commented 4 months ago

@jokasimr Sorry, I wanted to push to a new branch, but it ended up here! See my suggestion for simplification. Note small difference to exceptions we get from DataGroup, but to me the message makes sense, so I think we could take this option?

jokasimr commented 4 months ago

No worries!

I don't think it just raises a different exception type, it also raises an exception in cases when data group returns a value. Or am I misreading the errors in the CI test results?

SimonHeybrock commented 4 months ago

No worries!

I don't think it just raises a different exception type, it also raises an exception in cases when data group returns a value. Or am I misreading the errors in the CI test results?

Which one?

jokasimr commented 4 months ago

All of the five test cases hat failed. The failed cases raise on the line

assert_identical(nx[_slice], dg)

If the test reaches that line the scipp data group was sliced without any exception being raised. So if the nexus slicing raises on that line it is not because the error was of the wrong type, but because it raised when slicing the scipp data group didn't.

SimonHeybrock commented 4 months ago

All of the five test cases hat failed. The failed cases raise on the line

assert_identical(nx[_slice], dg)

If the test reaches that line the scipp data group was sliced without any exception being raised. So if the nexus slicing raises on that line it is not because the error was of the wrong type, but because it raised when slicing the scipp data group didn't.

I know, but to me the exception looks sensible, e.g., scipp._scipp.core.DimensionError: 'time' used for indexing not found in dataset dims ('x',).

jokasimr commented 4 months ago

Yeah okay then I misunderstood your original message, I thought you meant that the exception types were different.

Spontaneously I think it's better if slicing in scipp and slicing in scippnexus works the same. Having another set of rules in this case is confusing and should be avoided if possible.

SimonHeybrock commented 4 months ago

Yeah okay then I misunderstood your original message, I thought you meant that the exception types were different.

Spontaneously I think it's better if slicing in scipp and slicing in scippnexus works the same. Having another set of rules in this case is confusing and should be avoided if possible.

Last week we agreed that consistency within ScippNexus is more relevant than consistency with Scipp. We were considing to later improve Scipp's behavior, if it is inconsistent.

In the above example, what behavior would you expect instead, instead of this exception?