nexusformat / definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
26 stars 55 forks source link

Change to the axes attribute meaning #1381

Open jacobfilik opened 2 months ago

jacobfilik commented 2 months ago

Hello,

It has recently been raised at DLS that the documentation for NXdata now explicitly says that the axes attribute does not have to be the same length as the rank of the signal dataset.

I do not understand the reasoning for this change, the check that the axes attribute length matches the rank of the signal dataset has been a validation check we have used since the 2014 changes to the NXdata.

I have seen there is lots of discussion around this change in various issue and PRs, but I do not believe it is a sensible choice, since the length of the axes attribute matching the rank of the signal is a simple check (and if it wasn't necessary why have the "." marker at all).

In the docs, there are three examples where the axes attribute has been commented with ""the order does NOT matter" - the latter two showing the issue raised.

The first example looks reasonable, an x and y value for all values in the grid - likely to be the readback values from the stages in a grid scan (although if this was the case it would probably have been preferable to include the _set demand axes as well and tag them as axes for each dimension - if it is not a "meshed" grid scan I would wonder why all the datasets were not vectors).

The second example I would argue you should still have [x, . , y] as the axes attribute, so the _indices are consistent with that attribute - then at least software that doesn't understand the _indices would still correctly parse the NXdata.

The third example I don't know why the axes attribute is not ["x", "y", "energy"] or ["x", "y", "wavelength"] or ["x", "y", "energy", "."] - with the writer of the file deciding the priority (or which default axis to show - either wavelength or energy or just index). Again, any software that expects axes attribute length to match the signal rank would parse this, and anything that understands _indices should give the option to let the user change the axis for the final dimension.

With this change the axes attribute just becomes a list of axes names (which I can get by parsing the NXdata for anything ending in _indices). The first thing I have to do is now check axes length == signal rank and if it doesnt I completely ignore the axes attribute because it tells me nothing.

Can you give an explicit use case for this change of behaviour, or explain why the above interpretation is wrong?

jacobfilik commented 2 months ago

Also, the statement:

"The names of all [AXISNAME] fields are listed in the [axes] attribute."

Is I believe a big change, and has made all the files where we write read back and target/set/demand axes like (for a line scan in x):

data:NXdata
  @signal = "data"
  @axes = ["x_set"]
  @x_indices = [0]
  @x_set_indices = [0]
  data: float[10]
  x_set: float[10]
  x: float[10]

Are now not valid NeXus.

jacobfilik commented 2 months ago

Also, from the 2014 spec here:

axes:

String array that defines the independent data fields used in default plot for all of the dimensions of the signal field. One entry is provided for every dimension in the signal field.

The third example provides two entries for dim 3.

https://www.nexusformat.org/2014_axes_and_uncertainties.html

PeterC-DLS commented 1 week ago

An older version of the definition said:

String array that defines the independent data fields used in the default plot for all of the dimensions of the signal field (the signal field is the field in this group that is named by the signal attribute of this group). One entry is provided for every dimension in the signal field.

woutdenolf commented 1 week ago

So if I understand the issue: the number of elements in the axes used to be required to be equal to the rank of the signal and this limitation has been lifted.

The introduction of axes spanning more than one dimension and a single dimension that can be covered by multiple axes (multi-dimensional axes or not) prompted to lift this limitation. I don't see how this makes your files invalid. Could you provide an example of an NXdata structure that used to be valid and is no longer valid?

PeterC-DLS commented 1 week ago

If your application parsed the older NXdata correctly, then it would be broken by

  1. @axes mismatch check of array length with rank
  2. the list is not a default selection for the axes used in the plot
  3. each listed axis may no longer have corresponding shape at its dimension (though this is alleviated by its AXISNAME_indices attribute)
woutdenolf commented 1 week ago

Ok so the problem is not that old files are no longer valid. The problem is that existing software my not be able to read new files which take advantage of the new relaxed constraints on axes.

jacobfilik commented 1 week ago

OK, lots to go through here:

The introduction of axes spanning more than one dimension and a single dimension that can be covered by multiple axes (multi-dimensional axes or not) prompted to lift this limitation

I don't understand this argument - and hence the reason for this format change. I don't understand how _indices alone doesn't unambiguously describe these cases (axes could literally be [.,.,.]). The axes attribute just being an under-ordered list of axes names give no additional information to just parsing the NXdata for all the _indices tags.

Ok so the problem is not that old files are no longer valid.

Maybe I was interpreting the new specification incorrectly. We use _indices to assign multiple and multidimensional axes to dimensions of the signal, and at the same time have axes length matching the signal rank. I was assuming that the new specification meant that in this case the axes tag should be a list containing of all the _indices datasets. Ours is not, so if I was interpreting the spec correctly they are not valid NeXus files.

If this is not the correct interpretation, can you explain when axes should be an ordered list matching the rank of the signal dataset and when it should not?

woutdenolf commented 1 week ago

Code is less ambiguous than words. In the MR that refactored NXdata https://github.com/nexusformat/definitions/pull/1213 I added this snippet that takes an NXdata group and returns a dictionary of axis names and the dimensions they span:

from typing import Dict, List, Sequence
import h5py

def get_axes_dims(nxdata: h5py.Group) -> Dict[str,List[int]]:
    axes_data_dims = dict()
    axes = nxdata.attrs.get("axes", list())
    if isinstance(axes, str):
        axes = [axes]
    for axisname in set(axes):
        if axisname == ".":
            continue
        data_dims = nxdata.attrs.get(f"{axisname}_indices", None)
        if data_dims is None:
            data_dims = [i for i,s in enumerate(axes) if s == axisname]
        elif not isinstance(data_dims, Sequence):
            data_dims = [data_dims]
        else:
            data_dims = list(data_dims)
        axes_data_dims[axisname] = data_dims
    return axes_data_dims

In other words, the position of AXISNAME in @axes is a fallback in case AXISNAME_indices does not exist.

woutdenolf commented 1 week ago

In my experience long gitlab discussions like this don't help very much but since you insist, here we go:

since the length of the axes attribute matching the rank of the signal is a simple check

Sure it is simple, but what does it validate? You want to check that the field mentioned in attributes axes, signal and auxiliary_signals fields match right?

So I would assume you want to check whether the shape of an AXISNAME field is equal to the shape of the dimensions of signal it spans.

How is the check length(axes) == rank(signal) helping?

In the past AXISNAME was always a 1D dataset. Indeed in that case checking length(axes) == rank(signal) gives you part of the validation (although not enough since you still need to check the lengths of the AXISNAME).

Since multi-dimensional axes were introduced, what does the check len(axes) == signal.ndim tell you?

The first example looks reasonable, an x and y value for all values in the grid - likely to be the readback values from the stages in a grid scan (although if this was the case it would probably have been preferable to include the _set demand axes as well and tag them as axes for each dimension - if it is not a "meshed" grid scan I would wonder why all the datasets were not vectors).

I believe you are talking about this

data:NXdata
  @signal = "data"
  @axes = ["x", "y"]  --> the order does NOT matter
  @x_indices = [0, 1]
  @y_indices = [0, 1]
  data: float[10,20]
  x: float[10,20]     --> coordinates along both dimensions
  y: float[10,20]     --> coordinates along both dimensions

It is a 2D meshed grid. Although the requested scan would be a regular 2D grid, the actual motor positions you get from the encoder values are not regular. So each point in the 2D map has unique [X,Y] coordinates. At the ESRF all scan data is like that, mapping, tomography, whatever.

Suppose you say order does matter. Which order is correct, ["x", "y"] or ["y", "x"]? Unless we add meaning to the order of axes, I don't see why one would be correct and the other not. Hence "the order does not matter".

data:NXdata
  @signal = "data"
  @axes = ["x", "z"]       --> the order does NOT matter
  data: float[10,20,30]
  @x_indices = 0
  @z_indices = 2
  x: float[10]             --> coordinates along the first dimension
  z: float[30]             --> coordinates along the third dimension

The second example I would argue you should still have [x, . , y] as the axes attribute, so the _indices are consistent with that attribute - then at least software that doesn't understand the _indices would still correctly parse the NXdata.

I believe you are talking about this

data:NXdata
  @signal = "data"
  @axes = ["x", ".", "z"]  --> the order matters
  data: float[10,20,30]
  x: float[10]             --> coordinates along the first dimension
  z: float[30]             --> coordinates along the third dimension

and the equivalent

data:NXdata
  @signal = "data"
  @axes = ["x", "z"]       --> the order does NOT matter
  data: float[10,20,30]
  @x_indices = 0
  @z_indices = 2
  x: float[10]             --> coordinates along the first dimension
  z: float[30]             --> coordinates along the third dimension

And as you mentioned there is the third option to support software that does not handle AXISNAME_indices

data:NXdata
  @signal = "data"
  @axes = ["x", ".", "z"]   --> the order does NOT matter
  data: float[10,20,30]
  @x_indices = 0
  @z_indices = 2
  x: float[10]             --> coordinates along the first dimension
  z: float[30]             --> coordinates along the third dimension

Sure fair enough. All three options are valid and the last option supports both old and new software.

The third example I don't know why the axes attribute is not ["x", "y", "energy"] or ["x", "y", "wavelength"] or ["x", "y", "energy", "."] - with the writer of the file deciding the priority (or which default axis to show - either wavelength or energy or just index). Again, any software that expects axes attribute length to match the signal rank would parse this, and anything that understands _indices should give the option to let the user change the axis for the final dimension.

I believe you are talking about this

data:NXdata
  @signal = "data"
  @axes = ["x", "y", "energy", "wavelength"]  --> the order does NOT matter
  @x_indices = 0
  @y_indices = 1
  @energy_indices = 2
  @wavelength_indices = 2
  data: float[10,20,30]
  x: float[10]              --> coordinates along the first dimension
  y: float[20]              --> coordinates along the second dimension
  energy: float[30]         --> coordinates along the third dimension
  wavelength: float[30]     --> coordinates along the third dimension

Since you want length(axes) == rank(signal) then ["x", "y", "energy", "."] is not valid right?

This NXdata instance says that energy and wavelength span one dimension, namely dimension 2, the last dimension.

One argument that I always bump into when suggesting improvements to NXdata is:

NXdata provides data and coordinates to be plotted but does not describe how the data is to be plotted or even the dimensionality of the plot. https://www.nexusformat.org/NIAC2018Minutes.html#nxdata-plottype–attribute

With this change the axes attribute just becomes a list of axes names (which I can get by parsing the NXdata for anything ending in _indices).

Not really. The position of AXISNAME in @axes is a fallback in case AXISNAME_indices does not exist.

So to know which axes an NXdata plot has, you need to take all values from @axes, the loop through all attribute, check whether they end with the string "_indices" and then take the prefix to be another axis. Apart from this being a very convoluted and unintuitive way of defining which fields are axes, someone could add any attribute that happens to end with "_indices" and which has nothing to do with axes. I know it's a corner case but don't your agree it smells?

The first thing I have to do is now check axes length == signal rank and if it doesn't I completely ignore the axes attribute because it tells me nothing.

Why do you have to do this? As I discussed in the beginning, length(axes) == rank(signal) doesn't tell you anything in the presence of multi-dimensional axes and dimensions that can be spanned by more than one axis.

Can you give an explicit use case for this change of behavior, or explain why the above interpretation is wrong?

I hope the examples listed above make this clear

"The names of all [AXISNAME] fields are listed in the [axes] attribute." Are now not valid NeXus.

data:NXdata
  @signal = "data"
  @axes = ["x_set"]
  @x_indices = [0]
  @x_set_indices = [0]
  data: float[10]
  x_set: float[10]
  x: float[10]

This is still valid NeXus. You have one axis called x_set which has 10 values and one signal (i.e. no auxiliary signals) called data which has 10 values. Perfectly valid NeXus.

The fact that we also have x and x_set_indices is not interpreted as being part of the NXdata definition yes. I don't know any NeXus plotting software that goes looking for attributes ending with _indices to figure out other potential axes. As I mentioned before, that's a very convoluted and unintuitive way of defining which fields are axes.

I don't understand this argument

From all the above I think it is clear that length(axes) == rank(signal) does not mesh with axes spanning more than one dimension and a single dimension that can be covered by multiple axes

I don't understand how _indices alone doesn't unambiguously describe these cases

As I mentioned before, that's a very convoluted and unintuitive way of defining which fields are axes.

For signals we have @signal and @auxiliary_signals. And for axes we only have @axes.

We use _indices to assign multiple and multidimensional axes to dimensions of the signal, and at the same time have axes length matching the signal rank. I was assuming that the new specification meant that in this case the axes tag should be a list containing of all the _indices datasets. Ours is not, so if I was interpreting the spec correctly they are not valid NeXus files.

See the example

data:NXdata
  @signal = "data"
  @axes = ["x_set"]
  @x_indices = [0]
  @x_set_indices = [0]
  data: float[10]
  x_set: float[10]
  x: float[10]

Still valid NeXus as I mentioned.

can you explain when axes should be an ordered list matching the rank of the signal dataset and when it should not

In the absence of AXISNAME_indices you fall back to the position(s) of AXISNAME in @axes.

woutdenolf commented 1 week ago

Regardless of all this, since https://github.com/nexusformat/definitions/pull/1213 was not voted on because we thought it didn't change the standard we should explicitly mention that length(axes) == rank(signal) and also that to find out all the axes it is not enough to look at @axes. After changing this we can make a MR to revert the changes and vote on it.

woutdenolf commented 1 week ago

@jacobfilik Perhaps a shorter way to point to the issues or at the least the undefined behavior when requiring length(axes) == rank(signal): what would the @axes attribute be in this case and which part of the 2014 spec defines this?

data:NXdata
  @signal = "data"
  @axes = [?, ?, ?]  # length(axes) == rank(signal)`
  data: float[10, 20, 1024]
  x: float[10, 20]
  y: float[10, 20]
  data: float[10, 20]
  @x_indices = [0, 1]
  @y_indices = [0, 1]

In other words, if the length of @axes in this example needs to be 3 instead of a set of axes names with positional fallback in case _indices is missing, then what is the meaning of the first, second and third value of @axes in this example?

It seems clear now that length(axes) == rank(signal) used to be a requirement, which somehow survived the introduction of multi-dimensional axes and dimensions with more than one axis. Since the refactoring in #1213 changed that without vote we need to put it back.

But when we put it back we need to clearly define what it means. Since you are clearly convinced about this @axes requirement, could you provide an internally consistent definition with no room for interpretation on what it means for an axis name to be coupled to a signal dimension as a result of its place in @axes? Or coupled to more than one signal dimension if it appears in @axes more than once. Is that even allowed? Note that in the example, y is not an alternative axis to x. Think of this as a 10 x 20 XRF map (MCA with 1024 channels) and x and y are the effective motor positions derived from the encoders. FYI this is the most common scan data produced at the ESRF so definitely not a corner case.

jacobfilik commented 1 week ago

@woutdenolf In your example the axes attribute can literally be [.,.,.] [x,x,.] [x,y,.] it does not matter because you have explicitly set the _indices attributes. If _indices is missing you cant do anything in this case, and I would say that is not a valid file unless axes is [.,.,.] (_indices for all axes was a soft recommendation if i remember correctly, writers should add it, but readers should cope if its missing if possible, which in this case it is not).

Assuming in your example x is the fast axes, I would argue that the axes attribute should be [y,x,.] because at least then when you see x and y are 2D you can take the average/first line of x and y and get sensible axes for your signal.

Even better than that, since its a grid scan, i would recommend including the set axes as well, like:

data:NXdata @signal = "data" @axes = [y_set, x_set, .] data: float[10, 20, 1024] x: float[10, 20] x_set: float[20] y: float[10, 20] y_set:float[10] data: float[10, 20] @x_set_indices = [1] @y_set_indices = [0] @x_indices = [0, 1] @y_indices = [0, 1]

since signal is a grid, so will be best visualised against the requested linear axes, rather than the readback values which will only be as linear as your control system is configured/forced to make them. This is how we have been aspiring to write xrf grid scans for many years.

In the nexusformat example data repo there is an example file from diamond showing this:

https://github.com/nexusformat/exampledata/blob/master/DLS/p45/hdf5/p45-1168.nxs

Since you are clearly convinced about this @axes requirement,

Do you disagree that this is what is rewritten in the 2014 spec? My third comment I thought showed it pretty explicitly. If you can demonstrate why your proposal to change this 10 year old meaning of the standard is an improvement I would be happy to use it, but I have not seen that case. I am quite concerned that no one else from the NIAC (except Pete) has opinions on what I thought was one of the core parts of the standard.

I agree that the documentation could use improvement in this area and am happy to contribute. I am not part of the NIAC so have not been involved in the discussions so far, but am happy to join a call to discuss this issue.

woutdenolf commented 1 week ago

Maybe something like this?

The @axes attribute provides the default AXISNAME fields corresponding to each dimension of the signal. Each AXISNAME can have a corresponding @AXISNAME_indices which describes the signal dimension(s) spanned by the axis, hence allowing the definition of alternative axes for the signal dimension(s).

I guess that would be internally consistent. I'll think about it some more and make a proposal.

With that definition the example can be expressed as

data:NXdata
  @signal = "data"
  @axes = ["x", "x", "."]
  data: float[10, 20, 1024]
  x: float[10, 20]
  y: float[10, 20]
  data: float[10, 20]
  @x_indices = [0, 1]
  @y_indices = [0, 1]

or as

data:NXdata
  @signal = "data"
  @axes = ["y", "y", "."]
  data: float[10, 20, 1024]
  x: float[10, 20]
  y: float[10, 20]
  data: float[10, 20]
  @x_indices = [0, 1]
  @y_indices = [0, 1]

This would mean none of the ESRF data can be expressed as NXdata. Too bad but as I said, if we want to change the standard it needs a vote so we'll need to live with that for now.

jacobfilik commented 1 week ago

Ha, we commented at exactly the same time.

This is getting a bit messy for an issue, but for your above example I dont see why ["y","x","."] or even [".",".","."], or the two cases you gave would not be completely valid NeXus. Why is this stopping ESRF data being written as NXdata?

PeterC-DLS commented 1 week ago

Code is less ambiguous than words. In the MR that refactored NXdata #1213 I added this snippet that takes an NXdata group and returns a dictionary of axis names and the dimensions they span:

from typing import Dict, List, Sequence
import h5py

def get_axes_dims(nxdata: h5py.Group) -> Dict[str,List[int]]:
    axes_data_dims = dict()
    axes = nxdata.attrs.get("axes", list())
    if isinstance(axes, str):
        axes = [axes]
    for axisname in set(axes):
        if axisname == ".":
            continue
        data_dims = nxdata.attrs.get(f"{axisname}_indices", None)
        if data_dims is None:
            data_dims = [i for i,s in enumerate(axes) if s == axisname]
        elif not isinstance(data_dims, Sequence):
            data_dims = [data_dims]
        else:
            data_dims = list(data_dims)
        axes_data_dims[axisname] = data_dims
    return axes_data_dims

In other words, the position of AXISNAME in @axes is a fallback in case AXISNAME_indices does not exist.

Isn't this code buggy for the new all-axes-choices case where len(@axes) > rank?

PeterC-DLS commented 1 week ago

So if I understand the issue: the number of elements in the axes used to be required to be equal to the rank of the signal and this limitation has been lifted.

The introduction of axes spanning more than one dimension and a single dimension that can be covered by multiple axes (multi-dimensional axes or not) prompted to lift this limitation. I don't see how this makes your files invalid. Could you provide an example of an NXdata structure that used to be valid and is no longer valid?

The case of multi-dimensional axis fields was expounded in the 2014 document so it is not a new case that can prompt the change of rank-@axes requirement.

woutdenolf commented 1 week ago

Why is this stopping ESRF data being written as NXdata?

Because we have no sensible way of knowing what axes an NXdata group has. Looping over the attributes and checking whether they end with the _indices string is just not sensible and nothing prevents adding @anyname_indices="whatever".

I dont see why ["y","x","."] or even [".",".","."], or the two cases you gave would not be completely valid NeXus

Ok so this is the crux of the matter. If anything goes as long length(axes) == rank(signal), then what is @axes for? And what does this check tells you about the validity of the data?

Could you provide a definition for @axes?

woutdenolf commented 1 week ago

Assuming in your example x is the fast axes, I would argue that the axes attribute should be [y,x,.] because at least then when you see x and y are 2D you can take the average/first line of x and y and get sensible axes for your signal.

Ok this in combination with "I dont see why ["y","x","."] or even [".",".","."], or the two cases you gave would not be completely valid NeXus" completely breaks any logic from my pov. So if you could provide a definition of @axes it would help me understand you understanding.

woutdenolf commented 1 week ago

Isn't this code buggy for the new all-axes-choices case where len(@axes) > rank?

I don't see how but maybe I don't understand what the "all-axes-choices" is

PeterC-DLS commented 1 week ago

Isn't this code buggy for the new all-axes-choices case where len(@axes) > rank?

I don't see how but maybe I don't understand what the "all-axes-choices" is

That's @axes being the list of all possible AXISNAMEs case. Then

data_dims = [i for i,s in enumerate(axes) if s == axisname]

can contain integers > rank and also be of length > rank.

It is my opinion that @axes was meant to a list of the AXISNAMEs used as coordinate values in the corresponding dimensions of the default plot. This is what the 2014 proposal states.

woutdenolf commented 1 week ago

If you can demonstrate why your proposal to change this 10 year old meaning of the standard is an improvement

Because at least it defines @axes and @AXISNAME_indices in an unambiguous way:

The @axes attribute provides all the AXISNAME fields in the NXdata group, the corresponding @AXISNAME_indices attributes provide the signal dimensions they span and when missing the position(s) of AXISNAME in @axes are taken as the signal dimensions spanned by AXISNAME.

So if this is not a correct definition we need to fix it, but then I would like the see the correct definition. And not one based on examples, an actual definition like the one I provided.

And honestly, neither myself nor any of the reviewers realized we were changing the definition. And the reason imo is that NXdata was obfuscated to keep appearances up but did not make any sense logically. I've spend years trying to understand NXdata and still got it wrong apparently.

It is my opinion that @axes was meant to a list of the AXISNAMEs used as coordinate values in the corresponding dimensions of the default plot.

Excellent, so can you provide a definition of @axes and @AXISNAME_indices. I've been struggling to make one with this logic. Please share. I'm not going to use the argument I always get: NXdata does not tell you how to plot things.

data_dims = [i for i,s in enumerate(axes) if s == axisname]

This is a fallback when AXISNAME_indices does not exist. So if you omit the AXISNAME_indices and the data_dims are not the dimensions this axis is supposed to cover (like spanning a dimension that does not exist) you cannot omit AXISNAME_indices. I don't see why this makes the code/logic buggy. It is true that software ignoring AXISNAME_indices will break, but not this code. Is that what you meant?

Let me rename data_dims by axisname_indices to make it more clear (note that from this pov, "." is only there for old readers that don't use *_indices)

from typing import Dict, List, Sequence
import h5py

def get_axes_dims(nxdata: h5py.Group) -> Dict[str,List[int]]:
    """Return all the NXdata axes and the signal dimensions they span.
    """
    axes_data_dims = dict()
    axes = nxdata.attrs.get("axes", list())
    if isinstance(axes, str):
        axes = [axes]
    for axisname in set(axes):
        if axisname == ".":
            continue
        axisname_indices = nxdata.attrs.get(f"{axisname}_indices", None)
        if axisname_indices is None:
            axisname_indices = [i for i,s in enumerate(axes) if s == axisname]
        elif not isinstance(axisname_indices, Sequence):
            axisname_indices = [axisname_indices]
        else:
            axisname_indices = list(axisname_indices)
        axes_data_dims[axisname] = axisname_indices
    return axes_data_dims
prjemian commented 1 week ago

Certainly you do not mean "possible"? That could be an enormous list with enumerations for each dimension? The intent of this group is to declare the default visualization is some data. The axes attribute declares the default field associated with each dimension. A data file user could choose these defaults or override with their own.

On Thu, Jun 27, 2024, 11:44 AM Peter Chang @.***> wrote:

Isn't this code buggy for the new all-axes-choices case where @.*** https://github.com/axes) > rank?

I don't see how but maybe I don't understand what the "all-axes-choices" is

That's @axes being the list of all possible AXISNAMEs case. Then

data_dims = [i for i,s in enumerate(axes) if s == axisname]

can contain integers > rank and also be of length > rank.

It is my opinion that @axes was meant to a list of the AXISNAMEs used as coordinate values in the corresponding dimensions of the default plot. This is what the 2014 proposal states.

— Reply to this email directly, view it on GitHub https://github.com/nexusformat/definitions/issues/1381#issuecomment-2195194639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMAUUQJT7N6BDONKEI3ZJQ6NJAVCNFSM6AAAAABGWNGV4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVGE4TINRTHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

woutdenolf commented 1 week ago

@prjemian @PeterC-DLS @jacobfilik I think it will be more constructive that each of us writes down the definition of @axes and @AXISNAME_indices without resorting to examples and questions back and forth.

I'll repeat the (wrong) definition I had in mind when refactoring the NXdata docs in https://github.com/nexusformat/definitions/pull/1213

The @axes attribute provides the names of all the AXISNAME fields in the NXdata group. The corresponding @AXISNAME_indices attributes provide the signal dimensions they span and when missing the position(s) of AXISNAME in @axes are taken as the signal dimensions spanned by AXISNAME. The shape of an AXISNAME field is required to be equal to the shape of the spanned signal dimensions.

This definition unambiguously defines what all the axes fields are, what signal dimensions they span and what their shape must be. It does not provide the default axis for a dimension that is spanned by more than one axis.

To be clear with the word "shape" I mean this. Maybe there is a better word for that.

jacobfilik commented 1 week ago

OK. We are getting somewhere.

Looping over the attributes and checking whether they end with the _indices string is just not sensible and nothing prevents adding @anyname_indices="whatever".

Absolutely completely agree with this statement. It does work, but it is not nice and can easily lead to invalid files if one is not careful.

Can I propose a different solution: axes attribute remains as is, but we make some clarification of what it should contain, and we propose a new attribute, maybe _auxiliaryaxes (like auxiliary_signal) which contains your unordered list of all the datasets that should have matching _indices attributes, which should be added when "richer" axis information than can be described with just the axes attribute is required.

woutdenolf commented 1 week ago

Thanks for your proposal @jacobfilik. Sure we can talk about something like auxiliary_axes to make axes findable. But for now I want to focus on fixing the definition to what it was, even if it breaks all my assumptions or makes NXdata completely unusable, I don't care.

I would like to see a definition of axes and AXISNAME_indices that describes what all the axes fields are, what signal dimensions they span and what their shape must be. And then change the docs accordingly and not in a homeopathic way that re-introduces the convoluted mess we had before.

jacobfilik commented 1 week ago

Ok, but I have never had a problem with what is described here:

https://www.nexusformat.org/2014_axes_and_uncertainties.html

Which has the description and examples of multidimensional data and multiple axes per dimension. If this description is not reproduced (or linked to) in the NXdata object description that is a different issue and if there are cases that are poorly defined by this definition I am happy to discuss them.

woutdenolf commented 1 week ago

I have never had a problem with what is described here

Perfect. Then you are in the ideal position to write down the definition of @axes and @AXISNAME_indices in one paragraph like I did but then correct.

jacobfilik commented 1 week ago

The @axes attribute provides the names of the AXISNAME fields to be used the in default plot for all of the dimensions of the signal field. One entry is provided for every dimension in the signal field. When no default axis is available for a particular dimension of the plottable data, a “.” is used in that position. All AXISNAME fields should have a corresponding @AXISNAME_indices attribute providing the signal dimensions they span. The shape of an AXISNAME field is required to be equal to the shape of the spanned signal dimensions.

woutdenolf commented 1 week ago

Ok thanks! The definition leaves a couple of things open

  1. It does not define what all the possible AXISNAME fields are. Could you add this?
  2. It seems like this makes @AXISNAME_indices mandatory which would change the definition. I assume this was not your intention. If @AXISNAME_indices is optional, what dimensions does AXISNAME span when @AXISNAME_indices is missing?
  3. Could you define what "default plot" is? It is unclear to me. Perhaps we should rather say "The @axes attribute provides the names of the AXISNAME fields to be used as the default axis for all the signal dimensions."?
jacobfilik commented 1 week ago
  1. This is what i would propose auxiliary_axis attribute for
  2. From the spec linked above:

This attribute is to be provided in all situations. However, if the indices attributes are missing, file readers are encouraged to make their best efforts to plot the data. ...Strict writer, liberal reader....

So sounds like it is mandatory for writers? I assume its only optional in the NXdata definition because AXISNAME is also optional i.e. you cant mandate axes when there are no axes.

  1. I assume the default plot is the plot of the signal dataset. Its the wording of the spec linked, not my own. Your words are also fine to me.
prjemian commented 1 week ago

propose auxiliary_axis attribute

Searching the repo for "auxiliary", these seem relevant:

jacobfilik commented 1 week ago

The concept of auxiliary_axes was mentioned on the mailing list back in 2018 during the discussion of auxiliary_signals to compensate for the loss of the signal=n feature that was in the pre-2014 NXdata definition. It was suggested this could be used to provide a separate axis for the auxiliary signal but I think it was agreed that a separate signal with a separate axes should be a separate NXdata. To quote Armando:

It is so flexible that it would even allow to be generalized as auxiliary_signals, auxiliary_axes, so that even the condition of sharing the same axes is not needed and one would be plot, for instance, the raw data and the fitted data with different number of points....

I think now the question to be answered is if the case of not having common axes is ever to be considered of generic interest or not because, unlike the implementation of auxiliary_signals, the implementation of auxiliary_axes would potentially allow signals with different lengths and that would break code supporting auxiliary_signals and not checking for the presence of auxiliary_axes.

So this was never adopted and I think the use of the term auxiliary_axes here better fits with how auxiliary_signals works (i.e. both would be a list of datasets that meet specific criteria).

Having said that I would also be fine with a different name.

rayosborn commented 6 days ago

Thanks to @jacobfilik and @woutdenolf for spending so much time finding ways of making the 2014 axes definitions work. I have never had time to implement this in the NeXpy, because I never knew what to do with multidimensional axes. I spent some time investigating Voronoi plots, such as the following, but I was never sure whether it was worth writing generic software to produce them. voronoi Is this the main use case for multidimensional axes? i.e., when there are errors in measuring on a regular grid? If so, I wonder if it would be better if we completely split the axes from the pixel coordinates. What I mean is that we could accommodate both average and irregular grids by adding a completely new attribute called 'coordinates'. Then an application can decide which to use when plotting.

data:NXdata
    @signal = 'data'
    @axes = ['y_set', 'x_set']
    @coordinates = ['y', 'x']
    data[ny, nx]
    x_set[nx]
    y_set[ny]
    x[ny, nx]
    y[ny, nx]

Of course, we would need to work out how to generalize this to higher dimensions, and decide whether, for example, the length of the coordinates attribute has to be the same as the signal rank. Perhaps there would still be a role for 'x_indices', etc., although even that might not be necessary.

data:NXdata
    @signal = 'data'
    @axes = ['z', 'y_set', 'x_set']
    @coordinates = ['.', 'y', 'x']
    data[nz, ny, nx]
    x_set[nx]
    y_set[ny]
    z[nz]
    x[ny, nx]
    y[ny, nx]

I think the reason why the issues discussed here are so difficult to resolve is that we are trying to make the 'axes' attribute perform too many incompatible functions at the same time.

By the way, I was uncomfortable about the claim that, with the '_indices' attributes, the order wouldn't matter. When plotting in NeXpy, I still want to know what is the x-axis and what is the y, even if they are listed as coordinates.

About 15 years ago, I proposed that we had a separate NXdata subclass, in which the centers of each pixel were defined by coordinates, rather than axes, but I don't think I explained it very well and I never followed it up, unfortunately.

woutdenolf commented 5 days ago

Proposal to rectify the wrongful @axes doc changes: https://github.com/nexusformat/definitions/pull/1396