nexusformat / definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
26 stars 56 forks source link

NXdata - @AXISNAME_indices confusion #1212

Closed woutdenolf closed 10 months ago

woutdenolf commented 1 year ago
@axes

An axis slice is specified using a field named AXISNAME_indices
as described below (where the text shown here as AXISNAME is to
be replaced by the actual field name).

@AXISNAME_indices

Integer array that defines the indices of the signal field (that field will be
a multidimensional array) which need to be used in the AXISNAME field
in order to reference the corresponding axis value.

No idea what any of this means.

As far as I understand, the number of strings in @axes must be equal to the number of dimensions of the signal as well as the auxiliary signals. This means that there is exactly one axis field for each signal dimension, but an axis field can be re-used for multiple dimensions. Is this correct?

If this is correct, AXISNAME_indices is completely redundant. However we have

@AXISNAME_indices

This attribute is to be provided in all situations.

Which seems to indicate I'm missing something.

woutdenolf commented 1 year ago

https://github.com/nexusformat/definitions/pull/1210#discussion_r991033814

woutdenolf commented 1 year ago

So it seems this is possible (I had no idea)

data_2d:NXdata
  @signal="data"
  @axes=["time", "pressure", "pressure"]
  @time_indices=0
  @pressure_indices=[1, 2]
  data: float[1000,20,30]
  time: float[1000]
  pressure: float[20,30]
woutdenolf commented 1 year ago

I'll work on this once https://github.com/nexusformat/definitions/pull/1210 is resolved

PeterC-DLS commented 1 year ago

So it seems this is possible (I had no idea)

data_2d:NXdata
  @signal="data"
  @axes=["time", "pressure", "pressure"]
  @time_indices=0
  @pressure_indices=[1, 2]
  data: float[1000,20,30]
  time: float[1000]
  pressure: float[20,30]

axes=["time", "pressure", "."] would be a more general example - that is, the last dimension does not have an explicit scale but pressure varies across it.

woutdenolf commented 1 year ago

Ok I think I understand now. I'll make a PR to improve the docs accordingly.

rayosborn commented 1 year ago

Since you have raised the issue, I argued against the _indices attribute, and certainly against it being required, at the NIAC when it was first approved. For the reasons you gave at the top, it is in 99.99% of cases completely redundant, and in NeXpy, I have never encountered a file, for which the axes attribute was not sufficient. It just pollutes the NXdata group for no reason. Since then, I asked members of the committee to provide an algorithm for parsing the _indices attributes, but nobody ever has. In my opinion, it should be ditched, but perhaps this issue is a chance to finally educate me why it is needed.

woutdenolf commented 1 year ago

In my opinion, it should be ditched, but perhaps this issue is a chance to finally educate me why it is needed.

Then we're in the same boat. If it is not used in NeXpy, silx, h5web, ... then I don't see the point either.

Even in the example above where axes=["time", "pressure", "."] you don't really need it.

woutdenolf commented 1 year ago

Nevertheless I will improve the docs. Then we can perhaps discuss deprecation at some NIAC meeting.

prjemian commented 1 year ago

The two of you are in agreement with each other. This has been brought up before and discussed. I'm on travel now and will do the research y'all are missing when I get back. There are cases for this when it is important to have a distinction.

woutdenolf commented 1 year ago

Ok I guess this is an example where the _indices are needed

data_2d:NXdata
  @signal="data"
  @axes=["pressure", "time", "."]
  @time_indices=1
  @pressure_indices=[2, 0]
  data: float[20,1000,30]
  time: float[1000]
  pressure: float[30,20]
prjemian commented 1 year ago

Right, it's pathological for sure. Without _indices, how could you describe this example without ambiguity?

rayosborn commented 1 year ago

I think "pathological" is the right word. At the moment, everyone reading the NeXus manual has to get their head around what the purpose of the _indices attributes are, even though it affects 0.001% of (hypothetical) use cases. If @woutdenolf was confused, how do we expect newcomers to understand it? I argued this quite vociferously at the time (as @prjemian will recall). I didn't understand why these should be compulsory then, and, since nobody has given me any real-world examples where it actually mattered (actual files, not hypothetical examples), I haven't seen any reason to change my mind. If anyone has real-world examples, please post links to them. I genuinely would like something to test possible parsing algorithms on.

woutdenolf commented 1 year ago

What does this mean?

.. note::  Attributes potentially containing multiple values 
(axes and _indices) are to be written as string or integer arrays, 
to avoid string parsing in reading applications.
woutdenolf commented 1 year ago

The global docs of NXdata contain this text

For each data dimension, there should be a one-dimensional array of the same length.

which eliminates the use case above. I believe the global NXdata doc section contains too many things which as a duplicate of what is provided in the docs of the field and attributes. I will propose a significantly reduces version.

woutdenolf commented 1 year ago

I also see the words dimension scales. We should just use coordinates. For example the coordinates of all elements in the signal field or the coordinates in one dimension of the signal field etc.

woutdenolf commented 1 year ago

Also: https://manual.nexusformat.org/classes/base_classes/NXdata.html#nxdata-axisname-field

This forces the axes to have rank 1 which also eliminates the example.

woutdenolf commented 1 year ago

So the documentation contradicts itself. The fundamental question is: can the rank of an axis be larger than 1?

rayosborn commented 1 year ago

Thanks for looking into this so carefully. The dimension scales reference presumably mirrors the use of the term by HDF5. I think I heard that HDF5 dimension scales were introduced partly because of NeXus, although they came up with a different solution to ours. I agree that there is no reason we should use the term in our own documentation.

woutdenolf commented 1 year ago

For now I assume AXISNAME fields can have a rank larger than 1. In this proposal I synchronize all parts of the NXdata documentation to reflect that: https://github.com/nexusformat/definitions/pull/1213

The rendered version: https://hdf5.gitlab-pages.esrf.fr/nexus/nxdata_axes/classes/base_classes/NXdata.html

woutdenolf commented 1 year ago

In case of an axis that spans more than one dimensions, it would make more sense to do this

data_3d:NXdata
  @signal="data"
  @axes=["pressure", "time", "pressure"]
  @time_indices=1
  @pressure_indices=[2, 0]
  data: float[20,1000,30]
  time: float[1000]
  pressure: float[30,20]

than this

data_3d:NXdata
  @signal="data"
  @axes=["pressure", "time", "."]
  @time_indices=1
  @pressure_indices=[2, 0]
  data: float[20,1000,30]
  time: float[1000]
  pressure: float[30,20]

The "." should only be used when the data does not have coordinates along that dimension.

Would that change the current definition?

PeterC-DLS commented 1 year ago

An obvious example is an X-Y scan where the (x,y) coordinates are measured to high precision so x values would be different row-to-row down a column and y would differ along a row. Thus both x and y are 2D axis fields.

woutdenolf commented 1 year ago

FYI At the ESRF we never store data like that. Axes are always stored as a flat 1D arrays.

So suppose you do a 2D scan of 10 x 20 points than x and y are 1D arrays with length 200. Not all 2D scans measure a full 2D grid so you cannot save 2D array [N, M]. There could be skipped sections within the 2D area or the 2D area could be sampled in a different way than a regular grid. Even if you do a regular grid sampling, it is not completely regular anyway as the motor encoder values in x and y will tell you.

The problem we have with this is that NXdata does not support this type of data.

woutdenolf commented 1 year ago

You also need AXISNAME_indices if you want to store scatter data (previous comment) in a NeXus compliant way. We don't do this however: https://github.com/nexusformat/definitions/issues/1214

rayosborn commented 1 year ago

I looked into plotting irregular 2D arrays and even implemented a Voronoi plot in NeXpy as a potential solution, but it never made much sense. The irregular grid plot conveyed no extra information than the regular grid. I can see the need to use accurate coordinates in data analysis, but less so in plotting unless the motor positions are strongly distorted.

A while ago, I proposed that we consider an alternative NXdata-like class for storing counts and data coordinates. This would be something like NXevent_data, but not specific to time-of-flight neutron data. I think at the time, it seemed too hypothetical.

@PeterC-DLS, do you have examples in Diamond of NeXus files storing data in the way you describe?

woutdenolf commented 1 year ago

We did implement something like that in silx but we need to use non-compliant NXdata structures as described in #1214

woutdenolf commented 1 year ago
.. note::  Attributes potentially containing multiple values 
    (axes and _indices) are to be written as string or integer arrays, 
    to avoid string parsing in reading applications.

Ok I think I finally understand what this means. I will propose the following rewording

.. note::  When ``axes`` contains multiple strings, it must be saved as an actual array
    of strings and not a single comma separated string.
.. note::  When ``AXISNAME_indices`` contains multiple integers, it must be saved as an actual array
    of integers and not a comma separated string.
woutdenolf commented 1 year ago

Looking at https://www.nexusformat.org/2014_axes_and_uncertainties.html I found this unexpected example

NXroot
   NXentry
     NXdata
       @signal=“det”
       @axes=“x”,“y”,“tof”
       @tof_indices=2
       @x_indices=0,1
       @y_indices=0,1
       det: float[100,512,100000]
       tof: float[100000]
       x: float[100,512]
       y: float[100,512]

So essentially this means there is no logic in the order of the names in axes? Since both x and y span the first two dimensions, you could write @axes=“x”,“y”,“tof” or @axes=“y”,“x”,“tof”. There is no reason to choose one over the other.

It would make more sense to me to have @axes=“x”,“x”,“tof” or @axes=“y”,“y”,“tof” because we say in the docs that One entry is provided for every dimension in the signal field from which I immediately assume that the name in each dimension denotes the axis that covers that dimension. In other words, it is the 2D equivalent if this 1D example (x/y becomes pressure/temperature)

 NXroot
   NXentry
     NXdata
       @signal=“data”
       @axes=“time”,“pressure”
       @pressure_indices=1
       @temperature_indices=1
       @time_indices=0
       data: float[1000,20]
       pressure: float[20]
       temperature: float[20]
       time: float[1000]

To me it seems that the introduction of AXIXNAME_indices changes the meaning of axes but this was never actually dealt with.

prjemian commented 1 year ago

Isn't there a later version? This was the proposal.

woutdenolf commented 1 year ago

I took these examples from here: https://www.nexusformat.org/2014_axes_and_uncertainties.html which we also link in NXdata so I assume these are the official examples.

rayosborn commented 1 year ago

@woutdenolf, your example above shows that promoting the _indices approach, which as we have both agreed is nearly never required, threatens to completely undermine the previous use of the axes attribute. I have long argued that we should never accept new proposals until someone has (a) shown a real-world (not hypothetical) need for them and (b) provided an algorithm (or even better, Python code) that shows how it would be parsed in real-world plotting applications. Too many of our discussions are abstract and cause unnecessary complications that have not been carefully thought-through. Your work on plot transforms, even providing a Jupyter notebook with worked-through examples, shows how it should be done.

I repeat my earlier request for actual existing examples of files where _indices attributes are essential. I'm not suggesting getting rid of _indices altogether, but I think there is a case for relegating them to special pages describing advanced features of NeXus for arcane use cases.

woutdenolf commented 1 year ago

My proposal would be

  1. Either all axes have _indices or none have
  2. With _indices the order and the number of elements in @axes becomes irrelevant (and also the "." makes no sense)

Example without indices

data:NXdata
  @signal="data"
  @axes=["x", "y", "z"]
  data: float[10,20,30]
  x: float[10]
  y: float[20]
  z: float[30]

Example with indices

data:NXdata
  @signal="data"
  @axes=["x", "y", "z", "temperature"]   # order can be whatever
  data: float[10,20,30]
  x: float[10,30]
  y: float[20]
  z: float[10,30]
  temperature: float[10,20,30]
  x_indices: [0, 2]
  y_indices: 1
  z_indices: [0, 2]
  temperature_indices: [0, 1, 2]
woutdenolf commented 1 year ago

A simpler proposal: the dimensions spanned by AXISNAME are provided by AXISNAME_indices. When this field is missing, AXISNAME_indices defaults to the index or indices of AXISNAME in the attribute @axes.

This means in python, you would do this to get all axes and the dimensions they span:

def get_axes_dims(nxdata: h5py.Group) -> Dict[str,List[int]]:
    axes_data_dims = dict()
    axes = nxdata.attrs.get("axes", list())
    for axisname in set(axes):
        if axisname == ".":
            continue
        data_dims = nxdata.attrs.get(f"{axisname}_indices", None)
        if data_dims is None:
            data_dims = [i for i,s in enumerate(axes) if s == axisname]
        elif not isinstance(data_dims, Sequence):
            data_dims = [indices]
        else:
            data_dims = list(data_dims)
        axes_data_dims[axisname] = data_dims
    return axes_data_dims
woutdenolf commented 1 year ago

One thing is still unclear to me: what is the rank of AXISNAME?

By looking at AXISNAME docs and the examples I see two possibilities.

Possibility 1 (definition in spirit but not in practice)

The rank of AXISNAME must be equal to the number of data dimensions it spans.

This seems to be the intention as shown in these examples. However the official AXISNAME description explicitly states that the rank of AXISNAME must be 1.

A 2D scatter data example:

data:NXdata
    @signal = "data"
    @axes = ["x", "y"]
    @x_indices = [0, 1]
    @y_indices = [0, 1]
    data: float[10,20]
    x: float[10,20]  # rank equal to number of spanned dimensions
    y: float[10,20]  # rank equal to number of spanned dimensions

Possibility 2 (definition in practice but not in spirit)

The rank of AXISNAME must be equal to 1 and the number of values in AXISNAME must be equal to the number of values in the data dimensions it spans

This matches the official AXISNAME description but contradicts these examples.

If this would be the definition you would also need to describe the array order. For example we can say that the order must be C-order (i.e. the last dimension is the major dimension). As there is no mentioning of this anywhere in the documentation, I suspect this was never the intention of the definition.

A 2D scatter data example:

data:NXdata
    @signal = "data"
    @axes = ["x", "y"]
    @x_indices = [0, 1]
    @y_indices = [0, 1]
    data: float[10,20]
    x: float[200]  # rank equal to 1 (flattened in C-order)
    y: float[200]  # rank equal to 1 (flattened in C-order)

Proposal

The rank of AXISNAME must be equal to the number of data dimensions it spans.

Note that this is valid (@x_indices is missing so it is inferred from @axes as [0, 1])

data:NXdata
    @signal = "data"
    @axes = ["x", "x"]
    data: float[10,10]
    x: float[10,10]  # rank equal to number of spanned dimensions

But this is NOT valid (@x_indices is missing so it is inferred from @axes as [0, 1])

data:NXdata
    @signal = "data"
    @axes = ["x", "x"]
    data: float[10,10]
    x: float[100]  # rank NOT equal to number of spanned dimensions

A user would have to do this instead:

data:NXdata
    @signal = "data"
    @axes = ["x", "y"]
    data: float[10,10]
    x: float[10]  # rank equal to number of spanned dimensions
    y -> x         # HDF5 softlink
woutdenolf commented 1 year ago

Another thing that is not clear: the definition of AXISNAME _indices states that it must be an array: "Integer array that defines the indices of ...". However in the example and also these examples we have scalars. In the refactored definition I will state that AXISNAME _indices can be "a single integer or an array of integers".