nexusformat / definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
26 stars 56 forks source link

Allow dimensions in NXdata to be lists of strings #1246

Closed phyy-nx closed 8 months ago

phyy-nx commented 1 year ago

Two changes here:

With this change, part of the channel use case proposed by @soph-dec in #940 can be supported. Quoting the documentation for the current FileWriter2 proposal by Dectris with slight modification (the dimensions of data) as an example:

data: NXdata
  @signal = "data"
  @axes = ["image_id", "channel", ".", "."]
  @image_id_indices = 0
  @channel_indices = 1
  image_id = [1, ..., nP]
  channel = ["threshold_1", "threshold_2", "difference"]
  data = uint[nP, nC, i, j]

Where nP is the number of images and nC is the number of channels. Normally the channel field, aka AXISNAME, could only be of type NX_NUMBER, similar to the image_id field, but a set of English names is preferred to identify the channels.

Note, while the change in AXISNAME was approved by vote in nexusformat/NIAC#97, I do not know what effect this has on existing plotting software. For example the DIALS image viewer would need to know what channel to display by default. Also, this could affect the implementation of plot viewing in NeXpy.

We have a Telco tomorrow, so I hope we can discuss those issues then.

Closes #945 and nexusformat/NIAC#97. Partially addresses #940.

phyy-nx commented 1 year ago

Anyone know how to fix the warning that's being generated? :)

prjemian commented 1 year ago

Anyone know how to fix the warning that's being generated? :)

It's a new data type included in the PR.

prjemian commented 1 year ago

As stated in #945:

This needs a branch and pull request since the NIAC has already approved.

phyy-nx commented 1 year ago

Hi all, I had a meeting today about this PR and we have some changes (thanks @graeme-winter, @yayahjb, @soph-dec, and @kalcutter).

First, for 2D images, we note that if the data has the dimensions listed below, then existing general case NeXus readers/plotters should work:

Note the number of dimensions above: 2, 3, 3, and 4. Distinguishing between the two degenerate cases is done by the presence of the NXdetector_module class, which has the fields data_origin and data_size that can be expanded from 2 to three dimensions to handle the multi-panel case.

OK! So here are the four new use-cases that are not compatible with existing general-case plot making tools:

With the number of dimensions 3,4,4,5. Now we have another layer of degeneracy and no way to know how to plot the 2D image.

So the new proposal is to add an attribute @default_slice. You can see the docstring I've proposed in this PR, but briefly, it allows the user to reduce the channel degeneracy, or any degeneracy where there are more dimensions that the plotter can handle, by specifying which subset of the data to show.

In the below example, we combine both using strings as array indices and default_slice:

data: NXdata
  @signal = "data"
  @axes = ["image_id", "channel", ".", "."]
  @image_id_indices = 0
  @channel_indices = 1
  @default_slice = [".", "difference", ".", "."]
  image_id = [1, ..., nP]
  channel = ["threshold_1", "threshold_2", "difference"]
  data = uint[nP, nC, i, j]

So, what do folks think of this alternative?

Note to @mkoennecke, some feedback we received during this meeting was that since the proposed use case of multi-channel data already breaks existing viewers, there's no harm in extending the data type of AXISNAME. Existing software will still read data that doesn't use this feature just fine, but any software that wants to use this feature will need to understand that indices can be strings and which slice to show by default anyway.

phyy-nx commented 1 year ago

Any comments on this PR? Including @default_slice?

Also, I'm afraid I don't know enough about the build system to fix the current build error. Can someone lend a hand? Thanks!

phyy-nx commented 1 year ago

(updated above comment to indicate it's the data type of AXISNAME that is being extended to include strings, not AXISNAME_indices)

soph-dec commented 1 year ago

Also, I'm afraid I don't know enough about the build system to fix the current build error. Can someone lend a hand? Thanks!

Adding nxdl:NX_CHAR_OR_NUMBER to the members of primitiveType solved the build error for me locally.

phyy-nx commented 1 year ago

Thanks @soph-dec! Checks now pass.

woutdenolf commented 11 months ago

Is this correct?

Should be an array of strings of length equal to the number of dimensions in the data, with the following possible values:

    * ".": All the data in this dimension should be included
    * Integer: Only this slice should be used.
    * String: Only this slice should be used. Use if ``AXISNAME`` is a string array.

Isn't the idea to always have integers but saved as NX_CHAR because we need to support "."? Why does it depend on the type of AXISNAME? That won't work btw because you can have multiple axes that span a single data dimension, and they could be of type NX_NUMBER and NX_CHAR.

Edit: maybe we should use -1 instead of "." so @default_slice can be a list of integers?

phyy-nx commented 11 months ago

@woutdenolf I've added the type NXCHAR_OR_NUMBER to @default_slice which I think resolves your concern. This matches the type of of AXISNAME. Note I think hdf5 not allow you to mix ints and chars in an attribute array, but I don't think we need to call that out here.

woutdenolf commented 11 months ago

Ok thanks for the clarification. So basically default_slice has as many values as data dimensions and each value is either

phyy-nx commented 11 months ago

Ok thanks for the clarification. So basically default_slice has as many values as data dimensions and each value is either

  • "."
  • "<str>" (for example "difference")
  • "<int>" (for example "2")
  • <int> (for example 2)

Yup, that's my understanding. Thanks!

phyy-nx commented 9 months ago

Conflicts fixed, mostly trivial, with one re-word. @woutdenolf can you re-check this? Thanks!

phyy-nx commented 9 months ago

This PR is ready for a NIAC vote. Please vote with an emoji on this comment. Options include thumbs up for yes, thumbs down for no, and anything else to abstain. Thank you. Voting will close in two weeks.

paulmillar commented 9 months ago

Sorry, at the risk of exposing my ignorance, I have a few questions.

Should be an array of length equal to the number of dimensions in the data

This statement seems somewhat ambiguous to me.

Does the "should" refer to an expectation that @default_slice is an array, or does "should" refer to an expectation that if @default_slice is an array then that array has a length equal to the number of dimensions in the data?

What does the word "should" actually mean? Are we following RFC 2119 semantics, or is this meant as a more emphatic statement?

@default_slice is defined as having type NX_CHAR_OR_NUMBER. Does this mean it is valid for @default_slice to be an array of integers (rather than an array of strings)?

Given @default_slice as an array of strings, how are numerical value distinguished from an axis names? For example,

What are the processing expectations if AXISNAME is an array of NX_CHAR that contains an element "." and the @default_slice array contains the element "."?

What are the processing expectations if AXISNAME is an array of NX_CHAR, the first element is "2" and the @default_slice array contains the element "2"?

What are the processing expectations if AXISNAME is an array of NX_CHAR with two elements having the same value and that value is specified in @default_slice ?

phyy-nx commented 9 months ago

Thank you @paulmillar! Note that at this point in we are in the middle of a vote so it is too late to make changes to this PR, otherwise it would be unfair to those that have already voted. That has been the protocol at live NIAC meetings, where once a vote has been called for and seconded it has to finish. I don't see why it wouldn't be the case for an online vote as well. Indeed this PR has already been discussed and reviewed so I think that's fair.

THAT SAID, your questions are good, and I think you've identified one change I would have made in this PR. However, I think the change is clarification, and not substance, so I think it would be fine as a separate PR that doesn't need a vote. More below.

What does the word "should" actually mean? Are we following RFC 2119 semantics, or is this meant as a more emphatic statement?

Should is a requirement. I had never seen RFC 2119 before, thanks for pointing it out! Just to check, I reviewed NXmx to find usages of the word "should" and found 10. Each could equally be considered ambiguous based on RFC 2119. I'd prefer a separate review of the usage of imperatives outside of this PR.

Is it valid for @default_slice to be something other than an array? (e.g., a simple value, if data has only one dimension).

Yes to your specific example, a single value if the data has only one dimension. My understanding of NeXus is that this doesn't generally need to be specified, but I could be wrong here.

Is it valid for @default_slice to have length different from the number of dimensions in the data?

No. See "should" above.

@default_slice is defined as having type NX_CHAR_OR_NUMBER. Does this mean it is valid for @default_slice to be an array of integers (rather than an array of strings)?

NX_CHAR_OR_NUMBER is a new term defined here as "Any valid character string or NeXus number representation". So a list of integers is inherently included. Technically this should be NXCHAR_OR_INT I suppose...

Given @default_slice as an array of strings, how are numerical value distinguished from an axis names?

This is a good question. The second example lists @default_slice = [".", "2", ".", "."] as a possibility. I had presumed someone writing software to support this would attempt a lookup for "2" in the AXISNAME channel, not find it, and fall back to see if "2" can be parsed as a number. But this could be more clear in the documentation. Note you cannot mix integers and strings in an HDF5 array, so using "." in an array with integers for @default_slice has to be done this way.

What are the processing expectations if AXISNAME is an array of NX_CHAR that contains an element "." and the @default_slice array contains the element "."?

"." in @default_slice has a reserved meaning so the elements of AXISNAMEdon't matter. If you have "." as a value in AXISNAME, you won't be able specify it in @default_slice.

What are the processing expectations if AXISNAME is an array of NX_CHAR, the first element is "2" and the @default_slice array contains the element "2"?

Order of operations states that "2" will be searched for in AXISNAME and the index of "2" in AXISNAME will then be the slice used for @default_slice

What are the processing expectations if AXISNAME is an array of NX_CHAR with two elements having the same value and that value is specified in @default_slice?

This is undefined.

I really appreciate these comments. Of these, I'd like to add "order of operations" to @default_slice, to clarify that AXISNAME should be searched before a CHAR array is attempted to be parsed. However, I'd like to do that in a separate PR, as I think it's clarification, and not really a change that needs a vote. Do you agree?

For the rest, I think we have to be careful to avoid overdefining terms. I don't think we need to add these clarifications to the NXdata spec unless someone asks for them. Do you agree?

prjemian commented 9 months ago

Since "2" is not a valid name for a group or field, that logic could be included and the search for such an HDF5 address avoided.

prjemian commented 9 months ago

Same for ".".

phyy-nx commented 9 months ago

Ah good stuff @prjemian. Thanks.

phyy-nx commented 8 months ago

Vote has passed, thanks all.

phyy-nx commented 8 months ago

@paulmillar I believe @prjemian answered your questions best when he pointed out that "2" and "." are not legal AXISNAMEs. Is there anything in your comment you think should be made into an issue?