Closed woutdenolf closed 10 months ago
So it seems this is possible (I had no idea)
data_2d:NXdata
@signal="data"
@axes=["time", "pressure", "pressure"]
@time_indices=0
@pressure_indices=[1, 2]
data: float[1000,20,30]
time: float[1000]
pressure: float[20,30]
I'll work on this once https://github.com/nexusformat/definitions/pull/1210 is resolved
So it seems this is possible (I had no idea)
data_2d:NXdata @signal="data" @axes=["time", "pressure", "pressure"] @time_indices=0 @pressure_indices=[1, 2] data: float[1000,20,30] time: float[1000] pressure: float[20,30]
axes=["time", "pressure", "."]
would be a more general example - that is, the last dimension does not have an explicit scale but pressure varies across it.
Ok I think I understand now. I'll make a PR to improve the docs accordingly.
Since you have raised the issue, I argued against the _indices
attribute, and certainly against it being required, at the NIAC when it was first approved. For the reasons you gave at the top, it is in 99.99% of cases completely redundant, and in NeXpy, I have never encountered a file, for which the axes
attribute was not sufficient. It just pollutes the NXdata group for no reason. Since then, I asked members of the committee to provide an algorithm for parsing the _indices
attributes, but nobody ever has. In my opinion, it should be ditched, but perhaps this issue is a chance to finally educate me why it is needed.
In my opinion, it should be ditched, but perhaps this issue is a chance to finally educate me why it is needed.
Then we're in the same boat. If it is not used in NeXpy, silx, h5web, ... then I don't see the point either.
Even in the example above where axes=["time", "pressure", "."]
you don't really need it.
Nevertheless I will improve the docs. Then we can perhaps discuss deprecation at some NIAC meeting.
The two of you are in agreement with each other. This has been brought up before and discussed. I'm on travel now and will do the research y'all are missing when I get back. There are cases for this when it is important to have a distinction.
Ok I guess this is an example where the _indices
are needed
data_2d:NXdata
@signal="data"
@axes=["pressure", "time", "."]
@time_indices=1
@pressure_indices=[2, 0]
data: float[20,1000,30]
time: float[1000]
pressure: float[30,20]
Right, it's pathological for sure. Without _indices
, how could you describe this example without ambiguity?
I think "pathological" is the right word. At the moment, everyone reading the NeXus manual has to get their head around what the purpose of the _indices
attributes are, even though it affects 0.001% of (hypothetical) use cases. If @woutdenolf was confused, how do we expect newcomers to understand it? I argued this quite vociferously at the time (as @prjemian will recall). I didn't understand why these should be compulsory then, and, since nobody has given me any real-world examples where it actually mattered (actual files, not hypothetical examples), I haven't seen any reason to change my mind. If anyone has real-world examples, please post links to them. I genuinely would like something to test possible parsing algorithms on.
What does this mean?
.. note:: Attributes potentially containing multiple values
(axes and _indices) are to be written as string or integer arrays,
to avoid string parsing in reading applications.
The global docs of NXdata contain this text
For each data dimension, there should be a one-dimensional array of the same length.
which eliminates the use case above. I believe the global NXdata doc section contains too many things which as a duplicate of what is provided in the docs of the field and attributes. I will propose a significantly reduces version.
I also see the words dimension scales
. We should just use coordinates
. For example the coordinates of all elements in the signal field
or the coordinates in one dimension of the signal field
etc.
Also: https://manual.nexusformat.org/classes/base_classes/NXdata.html#nxdata-axisname-field
This forces the axes to have rank 1 which also eliminates the example.
So the documentation contradicts itself. The fundamental question is: can the rank of an axis be larger than 1?
Thanks for looking into this so carefully. The dimension scales
reference presumably mirrors the use of the term by HDF5. I think I heard that HDF5 dimension scales were introduced partly because of NeXus, although they came up with a different solution to ours. I agree that there is no reason we should use the term in our own documentation.
For now I assume AXISNAME
fields can have a rank larger than 1. In this proposal I synchronize all parts of the NXdata documentation to reflect that: https://github.com/nexusformat/definitions/pull/1213
The rendered version: https://hdf5.gitlab-pages.esrf.fr/nexus/nxdata_axes/classes/base_classes/NXdata.html
In case of an axis that spans more than one dimensions, it would make more sense to do this
data_3d:NXdata
@signal="data"
@axes=["pressure", "time", "pressure"]
@time_indices=1
@pressure_indices=[2, 0]
data: float[20,1000,30]
time: float[1000]
pressure: float[30,20]
than this
data_3d:NXdata
@signal="data"
@axes=["pressure", "time", "."]
@time_indices=1
@pressure_indices=[2, 0]
data: float[20,1000,30]
time: float[1000]
pressure: float[30,20]
The "." should only be used when the data does not have coordinates along that dimension.
Would that change the current definition?
An obvious example is an X-Y scan where the (x,y)
coordinates are measured to high precision so x
values would be different row-to-row down a column and y
would differ along a row. Thus both x
and y
are 2D axis fields.
FYI At the ESRF we never store data like that. Axes are always stored as a flat 1D arrays.
So suppose you do a 2D scan of 10 x 20 points than x
and y
are 1D arrays with length 200
. Not all 2D scans measure a full 2D grid so you cannot save 2D array [N, M]. There could be skipped sections within the 2D area or the 2D area could be sampled in a different way than a regular grid. Even if you do a regular grid sampling, it is not completely regular anyway as the motor encoder values in x and y will tell you.
The problem we have with this is that NXdata does not support this type of data.
You also need AXISNAME_indices
if you want to store scatter data (previous comment) in a NeXus compliant way. We don't do this however: https://github.com/nexusformat/definitions/issues/1214
I looked into plotting irregular 2D arrays and even implemented a Voronoi plot in NeXpy as a potential solution, but it never made much sense. The irregular grid plot conveyed no extra information than the regular grid. I can see the need to use accurate coordinates in data analysis, but less so in plotting unless the motor positions are strongly distorted.
A while ago, I proposed that we consider an alternative NXdata-like class for storing counts and data coordinates. This would be something like NXevent_data, but not specific to time-of-flight neutron data. I think at the time, it seemed too hypothetical.
@PeterC-DLS, do you have examples in Diamond of NeXus files storing data in the way you describe?
We did implement something like that in silx but we need to use non-compliant NXdata structures as described in #1214
.. note:: Attributes potentially containing multiple values
(axes and _indices) are to be written as string or integer arrays,
to avoid string parsing in reading applications.
Ok I think I finally understand what this means. I will propose the following rewording
.. note:: When ``axes`` contains multiple strings, it must be saved as an actual array
of strings and not a single comma separated string.
.. note:: When ``AXISNAME_indices`` contains multiple integers, it must be saved as an actual array
of integers and not a comma separated string.
Looking at https://www.nexusformat.org/2014_axes_and_uncertainties.html I found this unexpected example
NXroot
NXentry
NXdata
@signal=“det”
@axes=“x”,“y”,“tof”
@tof_indices=2
@x_indices=0,1
@y_indices=0,1
det: float[100,512,100000]
tof: float[100000]
x: float[100,512]
y: float[100,512]
So essentially this means there is no logic in the order of the names in axes
? Since both x and y span the first two dimensions, you could write @axes=“x”,“y”,“tof”
or @axes=“y”,“x”,“tof”
. There is no reason to choose one over the other.
It would make more sense to me to have @axes=“x”,“x”,“tof”
or @axes=“y”,“y”,“tof”
because we say in the docs that One entry is provided for every dimension in the signal field
from which I immediately assume that the name in each dimension denotes the axis that covers that dimension. In other words, it is the 2D equivalent if this 1D example (x/y becomes pressure/temperature)
NXroot
NXentry
NXdata
@signal=“data”
@axes=“time”,“pressure”
@pressure_indices=1
@temperature_indices=1
@time_indices=0
data: float[1000,20]
pressure: float[20]
temperature: float[20]
time: float[1000]
To me it seems that the introduction of AXIXNAME_indices
changes the meaning of axes
but this was never actually dealt with.
Isn't there a later version? This was the proposal.
I took these examples from here: https://www.nexusformat.org/2014_axes_and_uncertainties.html which we also link in NXdata so I assume these are the official examples.
@woutdenolf, your example above shows that promoting the _indices
approach, which as we have both agreed is nearly never required, threatens to completely undermine the previous use of the axes
attribute. I have long argued that we should never accept new proposals until someone has (a) shown a real-world (not hypothetical) need for them and (b) provided an algorithm (or even better, Python code) that shows how it would be parsed in real-world plotting applications. Too many of our discussions are abstract and cause unnecessary complications that have not been carefully thought-through. Your work on plot transforms, even providing a Jupyter notebook with worked-through examples, shows how it should be done.
I repeat my earlier request for actual existing examples of files where _indices
attributes are essential. I'm not suggesting getting rid of _indices
altogether, but I think there is a case for relegating them to special pages describing advanced features of NeXus for arcane use cases.
My proposal would be
_indices
or none have_indices
the order and the number of elements in @axes
becomes irrelevant (and also the "."
makes no sense)Example without indices
data:NXdata
@signal="data"
@axes=["x", "y", "z"]
data: float[10,20,30]
x: float[10]
y: float[20]
z: float[30]
Example with indices
data:NXdata
@signal="data"
@axes=["x", "y", "z", "temperature"] # order can be whatever
data: float[10,20,30]
x: float[10,30]
y: float[20]
z: float[10,30]
temperature: float[10,20,30]
x_indices: [0, 2]
y_indices: 1
z_indices: [0, 2]
temperature_indices: [0, 1, 2]
A simpler proposal: the dimensions spanned by AXISNAME
are provided by AXISNAME_indices
. When this field is missing, AXISNAME_indices
defaults to the index or indices of AXISNAME
in the attribute @axes
.
This means in python, you would do this to get all axes and the dimensions they span:
def get_axes_dims(nxdata: h5py.Group) -> Dict[str,List[int]]:
axes_data_dims = dict()
axes = nxdata.attrs.get("axes", list())
for axisname in set(axes):
if axisname == ".":
continue
data_dims = nxdata.attrs.get(f"{axisname}_indices", None)
if data_dims is None:
data_dims = [i for i,s in enumerate(axes) if s == axisname]
elif not isinstance(data_dims, Sequence):
data_dims = [indices]
else:
data_dims = list(data_dims)
axes_data_dims[axisname] = data_dims
return axes_data_dims
One thing is still unclear to me: what is the rank of AXISNAME
?
By looking at AXISNAME
docs and the examples I see two possibilities.
The rank of AXISNAME
must be equal to the number of data dimensions it spans.
This seems to be the intention as shown in these examples. However the official AXISNAME
description explicitly states that the rank of AXISNAME
must be 1.
A 2D scatter data example:
data:NXdata
@signal = "data"
@axes = ["x", "y"]
@x_indices = [0, 1]
@y_indices = [0, 1]
data: float[10,20]
x: float[10,20] # rank equal to number of spanned dimensions
y: float[10,20] # rank equal to number of spanned dimensions
The rank of AXISNAME
must be equal to 1 and the number of values in AXISNAME
must be equal to the number of values in the data dimensions it spans
This matches the official AXISNAME
description but contradicts these examples.
If this would be the definition you would also need to describe the array order. For example we can say that the order must be C-order (i.e. the last dimension is the major dimension). As there is no mentioning of this anywhere in the documentation, I suspect this was never the intention of the definition.
A 2D scatter data example:
data:NXdata
@signal = "data"
@axes = ["x", "y"]
@x_indices = [0, 1]
@y_indices = [0, 1]
data: float[10,20]
x: float[200] # rank equal to 1 (flattened in C-order)
y: float[200] # rank equal to 1 (flattened in C-order)
The rank of AXISNAME
must be equal to the number of data dimensions it spans.
Note that this is valid (@x_indices
is missing so it is inferred from @axes
as [0, 1]
)
data:NXdata
@signal = "data"
@axes = ["x", "x"]
data: float[10,10]
x: float[10,10] # rank equal to number of spanned dimensions
But this is NOT valid (@x_indices
is missing so it is inferred from @axes
as [0, 1]
)
data:NXdata
@signal = "data"
@axes = ["x", "x"]
data: float[10,10]
x: float[100] # rank NOT equal to number of spanned dimensions
A user would have to do this instead:
data:NXdata
@signal = "data"
@axes = ["x", "y"]
data: float[10,10]
x: float[10] # rank equal to number of spanned dimensions
y -> x # HDF5 softlink
Another thing that is not clear: the definition of AXISNAME _indices
states that it must be an array: "Integer array that defines the indices of ...". However in the example and also these examples we have scalars. In the refactored definition I will state that AXISNAME _indices
can be "a single integer or an array of integers".
No idea what any of this means.
As far as I understand, the number of strings in
@axes
must be equal to the number of dimensions of the signal as well as the auxiliary signals. This means that there is exactly one axis field for each signal dimension, but an axis field can be re-used for multiple dimensions. Is this correct?If this is correct,
AXISNAME_indices
is completely redundant. However we haveWhich seems to indicate I'm missing something.