scipp / scippnexus

h5py-like utility for NeXus files with seamless scipp integration
https://scipp.github.io/scippnexus/
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

`NXdata` group `@axes` attribute #196

Closed g5t closed 4 months ago

g5t commented 4 months ago

How @axes should be encoded/interpreted has been discussed previously in #145

Comparing what the NeXus standard says regarding how the @axes attribute should be encoded when applied to the 'DATA' dataset and its containing NXdata group, there is some ambiguity.

Namely, notice that both use the term 'array'

Defines the names of the coordinates (independent axes) for this data set as a colon-delimited array.

NXdata/data @axes

When axes contains multiple strings, it must be saved as an actual array of strings and not a single comma separated string.

NXdata @axes

Where this becomes ambiguous, is in that HDF5 allows for array-valued attributes (which h5py reads as numpy arrays), and scippnexus has interpreted both 'colon-delimited array' and 'actual array of strings' to specify a single string with axes names separated by literal ':' characters.

Issue

If someone interprets 'actual array of strings' to mean 'array-valued attribute with string elements', then creates a NeXus HDF5 file, like the one attached; that file can not be opened by scippnexus due to the use of the split method here: https://github.com/scipp/scippnexus/blob/58ffbbe67e90bf3341a7589e89e00e36859ba712/src/scippnexus/nxdata.py#L160

Possible solution

Add a check that self._signal_axes is a str before attempting to use .split, or alternately a check if it is a numpy array, e.g., replacing the quoted line 160 by

            self._signal_axes = tuple(self._signal_axes if isinstance(self._signal_axes, np.ndarray) else self._signal_axes.split(':'))

Alternate solution

If the interpretation of 'actual array of strings' should be limited to 'colon-delimited array' like the 'DATA' dataset attribute, petition the NeXus committee to clarify their boxed-note for the group attribute.

g5t commented 4 months ago

I failed to attach this file previously: monitor.h5.zip

:warning: Other unresolved issues prevent scippnexus from loading the group at /entry/instrument/007_frame_0

SimonHeybrock commented 4 months ago

Where this becomes ambiguous, is in that HDF5 allows for array-valued attributes (which h5py reads as numpy arrays), and scippnexus has interpreted both 'colon-delimited array' and 'actual array of strings' to specify a single string with axes names separated by literal ':' characters.

That is not correct. Scipp implements both, see the comments in the relevant code: https://github.com/scipp/scippnexus/blob/58ffbbe67e90bf3341a7589e89e00e36859ba712/src/scippnexus/nxdata.py#L153-L164

NXdata@axes is an "actual array of strings" (as the standard says), and the older NXdata/DATA@axes is a "colon-delimited array".

Issue

If someone interprets 'actual array of strings' to mean 'array-valued attribute with string elements', then creates a NeXus HDF5 file, like the one attached; that file can not be opened by scippnexus due to the use of the split method here:

The self._signal_name corresponds to the NXdata/DATA@axes attr, not the NXdata@axes attr. The former is a legacy way of defining axes, and I am not sure it is worth adding workarounds for broken legacy files?

Add a check that self._signal_axes is a str before attempting to use .split, or alternately a check if it is a numpy array, e.g., replacing the quoted line 160 by

            self._signal_axes = tuple(self._signal_axes if isinstance(self._signal_axes, np.ndarray) else self._signal_axes.split(':'))

If we fail to load (instead of using a fallback) then I agree we should improve that. However I am hesitant to interpret such a non-standard attribute. Where do we draw the line?

As a workaround for loading such files, note that it is possible to pass customized application definitions to snx.File. One could, e.g., remove the NXdata from the defs, so it will be loaded as a plain group:

import scippnexus as snx

my_definitions = snx.base_definitions()
del my_definitions['NXdata']
with snx.File(filename, definitions=my_definitions) as f:
    dg = f[()]
g5t commented 4 months ago

I see now that only NXdata@axes is recommended -- but why this information is not also listed on the NXdata page is a mystery.

My issue seems to stem from having a NXdata group nested inside of a NXmonitor, and the internal group getting picked-up as the monitor's self._signal_axes. Is it not valid to replace NXmonitor/data (an optional NX_NUMBERby a more-completeNXdata` group?