nexusformat / definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
26 stars 55 forks source link

add base class for X-ray element and edge #1293

Open newville opened 1 year ago

newville commented 1 year ago

this adds a base class for X-ray element and edge, as will be used in modifications to NXxas definition (coming soon).

newville commented 1 year ago

This PR now includes a proposed change to the NXxas definition, partly discussed in #1011.

I realize that this is not the final definition and that tying it to the addition of NXxrayedge may need re-basing and/or changing this PR.

I am not very confident in whether the XAS definition here matches the NeXuS expectations, but perhaps it is enough to re-kindle the discussion and review process.

phyy-nx commented 1 year ago

Thanks @newville

@nexusformat/developers who on the NIAC knows about XAS and can contribute a review here?

woutdenolf commented 1 year ago

I can review, I believe @benajamin knows this stuff as well.

woutdenolf commented 1 year ago

@newville FYI to syntax check and build the docs locally, you can do this

make install
make local

Could you also rebase your branch on the main?

woutdenolf commented 1 year ago

Thanks @newville for taking the initiative on this. My 5 cents:

Data fields

This is what we had originally:

/ENTRY/INSTRUMENT/monochromator/energy
/ENTRY/INSTRUMENT/incoming_beam/data
/ENTRY/INSTRUMENT/absorbed_beam/data
/ENTRY/MONITOR/data

/ENTRY/DATA/energy -> link
/ENTRY/DATA/absorbed_beam -> link

This is the proposal:

/ENTRY/INSTRUMENT/monochromator/energy
/ENTRY/INSTRUMENT/i0/data
/ENTRY/INSTRUMENT/itrans/data
/ENTRY/INSTRUMENT/ifluor/data
/ENTRY/INSTRUMENT/irefer/data
/ENTRY/SCAN/data
/ENTRY/SCAN/column_labels

/ENTRY/DATA/energy -> link
/ENTRY/DATA/i0 -> link
/ENTRY/DATA/itrans -> link
/ENTRY/DATA/ifluor -> link
/ENTRY/DATA/irefer -> link
/ENTRY/DATA/rawdata -> link

Some thoughts

Data table

Concerning the 2D table data table, suppose you have

Instead I would do this:

data:NXdata
   @axes = ["energy"]
   @signal = "itrans"
   @auxiliary_signals = ["i0", "irefer", "mycounter1", "mycounter2"]
   energy: float[1000]
   i0: float[1000]
   irefer: float[1000]
   itrans: float[1000]
   mycounter1: float[1000]
   mycounter2: float[1000]

In other words, the NXdata group already allows you to add signals like mycounter1 which are not explicitly described in the NXxas application definition. Of course this also means that any data processing software has no way of using those signals other than plotting (because their meaning is not explicitly provided).

Side note: the original NXxas definition says all NXdata fields have to be links. I see this in other application definition as well. Imo there is no need to specify whether this is a link to some NXdetector data or whether the data is provided directly (no link).

Further discussion

I'm fine with breaking the current NXxas applicatiom definition (e.g. incoming_beam -> i0). However this gives use the opportunity to consider supporting many different types of XAS (XANES, EXAFS, XES, RXES, RIXS, IXS, ...) and their different modes (emission vs. absorption, etc.). For example the names i0, itrans and irefer (traditional XANES/EXAFS) may not be appropriate if we want to cover several XAS techniques.

I will bring this up with our spectroscopy beamlines and make a list of techniques and the data you need for data processing.

woutdenolf commented 1 year ago

Ping @mretegan @maurov

Related issues: #1011

woutdenolf commented 1 year ago

@hgoerzig is working on RIXS. Perhaps it can be merged with the NXxas refactoring?

PeterC-DLS commented 1 year ago

Having done data reduction for RIXS, it is a very different technique (at I21, Diamond and ID32, ESRF). So its application definition probably does not overlap with XAS.

PeterC-DLS commented 1 year ago

I've asked some colleagues at Diamond to have a read through this PR too.

jacobfilik commented 1 year ago

I am happy to have a chat with the spectroscopy beamlines at DLS, but I do have a comment first:

The treatment of XAS data to me highlights the same issues with NeXus I have always had - The primary data the user wants to see is actually processed data, and to reflect the provenance of the data, there should be an NXprocess describing how the absorbance or fluorescence XANES/EXAFS was generated and which of the raw datasets (which may or may not have their own NXdata, depending how interesting they are to the user) were used, and what corrections have been applied.

I don't know how you are supposed to use NeXus to describe how this "signal" came from "processing" data from this/these "detectors".

I can see this becoming more important when list/event mode detectors are used for fluorescence and the actual structure of the raw data doesn't strongly correlate with the XAS spectrum.

woutdenolf commented 1 year ago

@jacobfilik True, the data under the NXdetector groups could actually be calculated. Perhaps these data fields can link to fields under an NXprocess group that lives under the NXxas entry. From the application definition POV you don't care whether these NXdetectors are actual physical detector signals or not. And from the data provenance POV you have the NXprocess group which contains information on the origin of these "detector" signals.

newville commented 1 year ago

@woutdenolf @jacobfilik @PeterC-DLS Thanks -- happy to discuss in detail. I'll try responding to comments in a few messages. Sorry this one is so long.

From @woutdenolf comments:

Some thoughts
All of these fields are currently required (already the case in the original NXxas)

I agree with your table of "current" and "proposed" fields, but I do not think that the old definition did actually specify all fields that are in the new definition. Well, unless you mean that one could implicitly add anything else, which doesn't seem very helpful. ;)

Mixing the 2D signal rawdata with 1D signals i0, it, ... in one NXdata is not going to work

This is why I have an NXscan instance with the 2D data table, as part of the data set which otherwise has a limited selection of 1D signals. That has to be allowed, no?

The 2D table data rawdata[nP,nCol] with column names column_labels[nCol] is unusual for NeXus

OK. Beamlines that collect scans of 1-D data (XAFS and related, essentially all Spec-running beamlines) all collect a data table with "nP" data points -- say X-ray energy, and "nCol" columns/channels/datastreams. Those columns may number in the 100s. They may vary from scan to scan. Sometimes column 19 is really, really the one you want.

Yes, we can reduce and/or curate data to select a few channels that become "i0", "itrans", "ifluor", "irefer", but that is often done in post-processing.

That's the data we have. We want to share it with other people. Having the raw data table is important to allow re-processing, perhaps changing dead-time corrections or throwing out 1 of 16 fluorescence channels.

Concerning the 2D table data table, suppose you have

rawdata[1000,6]
column_labels[6] = ["energy", "itrans", "i0", "irefer", "mycounter1", "mycounter2"]
Instead I would do this:

data:NXdata
   @axes = ["energy"]
   @signal = "itrans"
   @auxiliary_signals = ["i0", "irefer", "mycounter1", "mycounter2"]
   energy: float[1000]
   i0: float[1000]
   irefer: float[1000]
   itrans: float[1000]
   mycounter1: float[1000]
   mycounter2: float[1000]

I would be -1 on this. First, it scales well to 8 columns, but not so well to 100. Second, any processing tool would really want to extract and use the table. And tools can read the string labels and show the corresponding columns to users to add, do basic math manipulations while reading and pre-processing steps. We have tools for this that deal with random ASCII data files -- using this format would be no problem at all ;).

Side note: the original NXxas definition says all NXdata fields have to be links. I see
this in other application definition as well. Imo there is no need to specify whether this
is a link to some NXdetector data or whether the data is provided directly (no link).

I was under the impression that the NXdetector for, say "i0", has a "data" field -- that's the data array. We also want the NXxas to have a field "i0" that is that data array. That's either a link or a copy, no?

Further discussion
I'm fine with breaking the current NXxas applicatiom definition (e.g. incoming_beam -> i0).

Yeah, "incoming_beam" is weird, but "absorbed_beam" is completely baffling. Really, I do not know what that even means.

However this gives use the opportunity to consider supporting many different types of XAS
(XANES, EXAFS, XES, RXES, RIXS, IXS, ...) and their different modes (emission vs. absorption, etc.).
For example the names i0, itrans and irefer (traditional XANES/EXAFS) may not be appropriate
if we want to cover several XAS techniques.

Sure. FWIW, we could definitely use "ifluor" to mean any measurement of fluorescence or emission, including HERFD or XES. Or electron-yield. Or optical luminescence. We tend to use "fluorescence" to be equivalent to "emission".

For some of these, one might want to include other dimensions of data too. For the HERFD data that I collect, it would be totally reasonable to add an area detector image for each energy point. RIXS is typically a 2D dataset, not 1D - I don't know if that sort of extensibility is easily accommodated.

And, some people do measure XAFS in full-field imaging mode, more similar to STXM.

`` I will bring this up with our spectroscopy beamlines and make a list of techniques and the data you need for data processing.



For you and @PeterC-DLS, I will also be bringing this up for conversation at a spectroscopy community meeting at an IUCr satellite meeting in August. 
newville commented 1 year ago

@jacobfilik

The treatment of XAS data to me highlights the same issues with NeXus I have always had - The primary data the user wants to see is actually processed data, and to reflect the provenance of the data, there should be an NXprocess describing how the absorbance or fluorescence XANES/EXAFS was generated and which of the raw datasets (which may or may not have their own NXdata, depending how interesting they are to the user) were used, and what corrections have been applied.

Yeah, I have to admit to being completely mystified by the focus on "plottable" or "primary" data with NeXuS. We have a bunch of signal chains and we want visualize them. XAFS is relatively easy (my day job is running an X-ray microprobe, sometimes with simultaneous maps of XRF and XRD spectra: the "plottable data"? Yes, all of it), and yeah we want to plot either -log(Itrans/I0) or Iflour/I0. So what? We are not going to use a data format to plot data. The goal is to store and read the data.

I don't know how you are supposed to use NeXus to describe how this "signal" came from "processing" data from this/these "detectors".

Well, a detector has a data array that holds the signal the detector measured. We will use that to construct a data set to analyze and/or visualize. So, yes that data, especially in "ifluor", could be data calculated from a series of raw-data channels (say "sum of dead-time-corrected channels").

I can see this becoming more important when list/event mode detectors are used for fluorescence and the actual structure of the raw data doesn't strongly correlate with the XAS spectrum.

I don't think the measurement mode needs to matter. I measure data in "continuous scan mode" -- in the end there is a list of energy points and the signals for each.

But, yeah if you want to save full multi-channel XRF in list mode, that would be a totally different layout for the raw data.

woutdenolf commented 1 year ago

I would be -1 on this. First, it scales well to 8 columns, but not so well to 100.

Why not? You can have thousands, no problem.

And tools can read the string labels and show the corresponding columns to users to add

column_names = axes + signal + auxiliary_signals

The splitting allows for a default plot but for the rest it's the same.

We want to share it with other people. Having the raw data table is important to allow re-processing, perhaps changing dead-time corrections or throwing out 1 of 16 fluorescence channels.

Absolutely. From a programming perspective it is much easier imo to have a dictionary of 1D arrays than a 2D array with a list of column names. I bet any software immediately converts that 2D array with associated column labels in a dictionary or an object that resembles a dictionary (like a pandas dataframe).

I agree with your table of "current" and "proposed" fields, but I do not think that the old definition did actually specify all fields that are in the new definition.

I just wanted to summarize for future reference what the current NXxas data fields are and what the proposed NXxas data fields are.

Yeah, "incoming_beam" is weird, but "absorbed_beam" is completely baffling. Really, I do not know what that even means.

Absolutely. I would like to further discuss this though, especially when we think of supporting beyond the standard XANES/EXAFS data. Do i0, itrans, ifluo and irefer make sense? Could we do something more generic? Ultimately we have different mu-related signals and associated monitor signals. In the traditional I0/I1/I2 setup you have two mu-related signals {I1, mon=I0} and {I2, mon=I1}. In the case of fluorescence with multiple elements or fullfield, you can have thousands of mu-related signals (not necessarily thousands of HDF5 datasets if we allow >1D datasets, which we should allow imo).

maurov commented 1 year ago

@woutdenolf thanks for including me in this very interesting and important conversation. Unfortunately, I am busy on the beamline now and I will not be able to review this soon.

jacobfilik commented 1 year ago

Thank you for the replies and especially @newville thanks for starting this process.

Potentially it might be worth making a list of "things we expect this application definition to enable", to understand if we have a sensible structure with all the required information.

Largely "plotting a sensible thing" is handled by the base classes, so as far as I am concerned it comes for free. Whether the I0 or Iref data should be viewed as "interesting plottable data" and reside in an NXdata or "debug information if things dont look right" and should be somewhere in the Instrument specified by the application definition is a different call.

Reprocessing of data, like re-applying the deadtime correction, re-windowing/re-fitting the fluorescence is a good one, as is rejection of channels.

We frequently take a large collection of scans, possibly do some outlier rejection/glitch correction, average repetitions, run them through larch in a jupyter notebook to normalise/chi(k) transform, plot the edge position as a function of either (a) repetition - to check the sample is not changing with beam damage, or (b) time/environmental parameter - to see if something is happening to the sample during the experiment or compare different samples. Maybe check the reference foil to make sure there is no drift in energy, run PCA/factor analysis etc on the flattened normalised data etc.

What I would like from an application definition is to be able to run the same notebook on the 3 beamlines that do this, but currently all call their NXdata groups different names. I would also like to be able to make UI that, when I have derived some data (like the edge position, or a PCA score value) I can easily navigate from that specific data point, through the normalised data to the raw data that generated it (without relying on the raw hdf5 dataset ids to connect NXdata groups to NXdetectors). Obviously it would be brilliant if our data "just worked" with other peoples software and vise versa.

I am happy to get permission to share some of the files generate by the different XAS measurements here at Diamond and do some work to "repackage" them to match potential application definition suggestions - I think the scientists would be keen on this, especially before Q2XAFS 2023 - but are these things often designed in the comments of a pull request?

newville commented 1 year ago

@woutdenolf

Thanks. Again, sorry this is so long...

I would be -1 on this. First, it scales well to 8 columns, but not so well to 100.

Why not? You can have thousands, no problem.

Well, you then have 100 or 1000 HDF5 datasets in a group with unpredictable names instead of one group called "data" that had a corresponding "column labels" string array to identify the columns. That is you wouldn't have to look for groups by name with names like "feka_mca1", "feka_mca2", ..., you would find the indices of those strings in the column label array and use the indices to access the columns.

Like, a data collection program could create a folder for each scan and write 1000 text files each holding 1 array for 1 data channel. That would be acceptable. But the alignment of the data points is then only implied (and maybe even made less clear). A single file with multiple columns appears to have been the choice of a majority of people doing this ;).

And tools can read the string labels and show the corresponding columns to users to add

column_names = axes + signal + auxiliary_signals

The splitting allows for a default plot but for the rest it's the same.

OK, I don't have a strong preference. The whole"default plot" seems not that important to me.

We want to share it with other people. Having the raw data table is important to allow re-processing, perhaps changing dead-time corrections or throwing out 1 of 16 fluorescence channels.

Absolutely. From a programming perspective it is much easier imo to have a dictionary of 1D arrays than a 2D array with a list > of column names. I bet any software immediately converts that 2D array with associated column labels in a dictionary or an object that resembles a dictionary (like a pandas dataframe).

Well, that's one way to do it. But you can construct that either from a 2D table with columns names and indices or by storing many 1D arrays. If a dictionary was an option, I agree that would be better. With HDF5 storage it is not an option and one would have to loop over the datasets within a group, check that they have data (because some datasets in "Scan Data" might be strings or scalars or other things), and then extract the data into that dictionary. With a single HDF5 dataset that has a consistent name (say, "data") and a single array of strings, also with a consistent name like "column_labels", then that dictionary can be constructed just as easily.

In fact, I have done both approaches (for XRF ROI images). The 2D data method is actually significantly more efficient. And for "raw scan data" that is sort of in the category of "someone might want to go back and reprocess this", I just do not see why you want 100 datasets that are not explicitly spelled out.

I would say that a pandas dataframe maps or anything like a CSV table maps better to 2D with labels rather than a series of 1D arrays, where column alignment is only implied and might not be obvious to everyone.

I agree with your table of "current" and "proposed" fields, but I do not think that the old definition did actually specify all fields that are in the new definition.

I just wanted to summarize for future reference what the current NXxas data fields are and what the proposed NXxas data fields are.

OK.

Yeah, "incoming_beam" is weird, but "absorbed_beam" is completely baffling. Really, I do not know what that even means.

Absolutely. I would like to further discuss this though, especially when we think of supporting beyond the standard XANES/EXAFS data. Do i0, itrans, ifluo and irefer make sense?

Well, there is precedence for these. Here, I tried to follow the "XAS Data Interchange" format, an ASCII format mostly defined by Bruce Ravel (@bruceravel), but maybe with a bit of discussion and help from others. This was discussed in a 2012 publication (https://doi.org/10.1107/S0909049512036886) that came out of an XAFS workshop, and then in a bit more depth and final form in a 2016 conference proceeding (https://doi.org/10.1088%2F1742-6596%2F712%2F1%2F012148)

I used names based on the names in XDI (see https://github.com/XraySpectroscopy/XAS-Data-Interchange/blob/master/specification/dictionary.md#defined-items-in-the-column-namespace)

In fact, I would say that easy conversion between XDI and NeXuS NXxas would be a top priority. If we can get to a point where facilities and beamlines are sharing data in either of these formats, we would then be pretty confident that the data will be usable by downstream applications.

FWIW, the XDI definition also makes clear that mono d-spacing is really, really useful and if not exactly required, and least indicated (so that historical data marked only as "Si(111)" was acceptable).

Could we do something more generic? Ultimately we have different mu-related signals and associated monitor signals. In the traditional I0/I1/I2 setup you have two mu-related signals {I1, mon=I0} and {I2, mon=I1}. In the case of fluorescence with multiple elements or fullfield, you can have thousands of mu-related signals (not necessarily thousands of HDF5 datasets if we allow >1D datasets, which we should allow imo).

Yeah, XDI uses i0 / itrans / irefer instead of i0 / i1 / i2 partly because it allows ifluor and is a bit clearer on intent. Also, some people will record a reference signal using scatter and so the reference mu would not be -log(irefer/itrans) but (typically) irefer/i0. That can be commented (or made an enumeration).

I do get that "mon" is more common for NeXuS, and it would be fine to make "i0" and "mon" or "monitor" be aliases of one another.

For multi-element fluorscence / full-field / related spectroscopies, let me separate them a bit:

Multi-element fluorescence is extremely common, typically more common than all other modes combined. Here, "ifluor" would be intended to hold the sum of appropriate ROIs, ideally dead-time corrected. But the raw data would have the individual signals for ROIs (possibly multiple of these per detector channel) and data for dead-time corrections (a few different ways to encode). It is pretty common for XAFS software to need to be able to help the user redo the corrections and sums (say, throwing out a bad channel). In the end, you end up with one fluorescence signal, which can go into "ifluor".

Similarly, "HERFD" is becoming more common but also sort of is covered by having an "ifluor".

For full-field or some ways of measuring IXS data, you basically have a set of images per energy point. It might be OK to then allow "itrans" or "ifluor" to be a 2D array, but I sort of think that might be a different thing entirely, and more like STXM.

I'm OK with trying to be accommodating of those modes, but also a single XAS scan is sort of common enough (and potentially complicated enough) that having a definition that covers that well seems worth getting right.

Maybe an NXxas_fullfield definition would just be NXxas with itrans being a 2D array?

newville commented 1 year ago

@jacobfilik

Potentially it might be worth making a list of "things we expect this application definition to enable", to understand if we have a sensible structure with all the required information.

Largely "plotting a sensible thing" is handled by the base classes, so as far as I am concerned it comes for free. Whether the I0 or Iref data should be viewed as "interesting plottable data" and reside in an NXdata or "debug information if things dont look right" and should be somewhere in the Instrument specified by the application definition is a different call.

Reprocessing of data, like re-applying the deadtime correction, re-windowing/re-fitting the fluorescence is a good one, as is rejection of channels.

My concern is almost entirely "how to store the data so that someone far away and a decade from now can make sense of it".

I don't really understand the focus on plotting. NeXus stores data. So do CSV, netCDF, HDF5, and XML. None of them even consider the visual display of the stored data. It is mentioned all the time for NeXus.

We frequently take a large collection of scans, possibly do some outlier rejection/glitch correction, average repetitions, run them through larch in a jupyter notebook to normalise/chi(k) transform, plot the edge position as a function of either (a) repetition - to check the sample is not changing with beam damage, or (b) time/environmental parameter - to see if something is happening to the sample during the experiment or compare different samples. Maybe check the reference foil to make sure there is no drift in energy, run PCA/factor analysis etc on the flattened normalised data etc.

Yep. I think the goal here (at least, my goal) is to store data (~"raw data" for some definition or "raw") in a way that can be easily exchanged across time and space. With that, applications ought to be able to handle these files easily, and we will make sure that any Python application actually can.

What I would like from an application definition is to be able to run the same notebook on the 3 beamlines that do this, but currently all call their NXdata groups different names.

Do you have examples of XAS data in NeXus-like formats? I have asked a few people, but I don't think I have any that use the current NXxas definition. Can you send some?

I would also like to be able to make UI that, when I have derived some data (like the edge position, or a PCA score value) I can easily navigate from that specific data point, through the normalised data to the raw data that generated it (without relying on the raw hdf5 dataset ids to connect NXdata groups to NXdetectors). Obviously it would be brilliant if our data "just worked" with other peoples software and vise versa.

A common definition of "NeXus-encoded XAS data" ought to help with that.

I am happy to get permission to share some of the files generate by the different XAS measurements here at Diamond and do some work to "repackage" them to match potential application definition suggestions - I think the scientists would be keen on this, especially before Q2XAFS 2023 - but are these things often designed in the comments of a pull request?

Yes, please do. We will discuss this at Q2XAFS 2023 in Melbourne. I expect that a definition derived from XDI like this would be acceptable to everyone there.

woutdenolf commented 1 year ago

Well, you then have 100 or 1000 HDF5 datasets in a group with unpredictable names instead of one group called "data" that had a corresponding "column labels" string array to identify the columns.

unpredictable names? they are provided in axes + signal + auxiliary_signals which is the same as column_names.

newville commented 1 year ago

Well, you then have 100 or 1000 HDF5 datasets in a group with unpredictable names instead of one group called "data" that had a corresponding "column labels" string array to identify the columns.

unpredictable names? they are provided in axes + signal + auxiliary_signals which is the same as column_names.

I meant that names might be "feka_mca1", "feka_mca2" in one file and then "Fe_mca1", "Fe_mca2" in another, and "Pt La ch1", "Pt La ch2" in another. Seeing that the data has type NXxas would mean that some names ("i0", "itrans", "irefer") can be expected to be present, but that many other datasets (quite possibly all datasets in "NXscan") would not have predictable until reading some other portion of the file. With a 2D table, that would be "data" in the NXscan. We can also put in the file description: there may be a thing called "NXscan/data". That will be 2D, nP x nC and hold raw data arrays. Column names can be read from "NXscan/column_labels". That just seems simpler to me.

woutdenolf commented 1 year ago

I meant that names might be "feka_mca1", "feka_mca2" in one file and then "Fe_mca1", "Fe_mca2" in another, and "Pt La ch1", "Pt La ch2" in another

Ok I see now. Thanks for clarifying. Depending on the edge you are probing, the column names are different, typically in the fluorescence case.

I'm not sure however this is a good idea. The edge we are probing is already saved somewhere else. A data analysis software just needs to know which dataset(s) are the fluorescence signal (a single channel, 2 channels like in the example or many channels like in full-field). Allowing these dataset(s) to be named whatever will require a user selecting which columns are the fluorescence signals while the software should be able to infer this from the application definition. If we don't give explicit meaning to datasets (essentially how does this signal relate to the linear attenuation coefficient for traditional XANES/EXAFS), we don't need NXxas at all. We could just save the 2D table in an NXdata under an NXentry and be done with it.

There could be other ways to tag signals with meaning while still allowing random names. Nevertheless, adding meaning to signals is necessary imo. And it does not prevent you from adding other signals with unknown meaning, although this is not very useful for software that need to process NXdata data autonomously.

It is pretty common for XAFS software to need to be able to help the user redo the corrections and sums

Indeed. So having an ifluor or random names like "feka_mca1", "feka_mca2" is not helping. For tomography we tag signals to be dark, flat, projection etc. We should do the same here (fluo, livetime, transmission, ...).

Side node: NXfluo could benefit from the same refactoring. It is just as useless as the current NXxas.

So I see two separate but perhaps related questions:

  1. How do we give meaning to all signals (by giving them fixed names or by other means)
  2. How do we divide signals over datasets or fields (dataset is the HDF5 term, field is the NeXus term)

As for point 1, there are two proposals for the NXdata group in NXxas (not talking about the NXdetector's here). The first one is a 2D table with string-values axes along the second dimension (requires https://github.com/nexusformat/definitions/pull/1246 to be accepted first)

data:NXdata
   @axes = [".", "column_names"]
   @signal = "raw_data"
   column_names = ["energy", "i0", "itrans", "feka_mca1", "feka_mca2"]
   raw_data: float[1000, 5]

The second one is a more traditional NeXus approach

data:NXdata
   @axes = ["energy"]
   @signal = "itrans"
   @auxiliary_signals = ["energy", "i0", "feka_mca1", "feka_mca2"]
   raw_data: float[1000]
   i0: float[1000]
   itrans: float[1000]
   feka_mca1: float[1000]
   feka_mca2: float[1000]

In terms of reading performance, approach 1 is probably faster if you read everything in memory at once. However since signals are typically saved in individual datasets by the acquisition system, raw_data will have to be a virtual dataset in which case I'm not so sure it will be faster than approach 2. Approach 2 is probably faster if you only read the columns the user is interested in. No need for slicing when reading, especially tricky when keeping chunk size, chunk caching and compression in mind.

woutdenolf commented 1 year ago

@newville @jacobfilik @PeterC-DLS @hgoerzig @benajamin Very interesting discussions but github comments are perhaps too limited for this purpose. We should find another way to organize this. The scope is XAS, which means NXxas and potentially other (new) application definitions (like the IXS family). I would even include NXfluo as well in this effort.

I propose this: we each take some time the structure some actual datasets from our beamlines as we or our scientists would intuitively want to do it. XANES, EXAFS, XES, RXES, RIXS, IXS, ... fullfield or not ... whatever we think is related enough. This will take time (several months for sure) but then we can have zoom meeting(s) and start from concrete examples. Everyone explains what information can/needs to be inferred from their application definition by machines (i.e. without human interference). Other considerations are ergonomics for readers and writers, read/write performance, etc.

woutdenolf commented 1 year ago

If a dictionary was an option, I agree that would be better. With HDF5 storage it is not an option and one would have to loop over the datasets within a group, check that they have data (because some datasets in "Scan Data" might be strings or scalars or other things), and then extract the data into that dictionary.

When using h5py, an HDF5 group behaves like a dictionary. Just to be clear, you don't need to loop over the datasets to discover them. This information is provided by the NXdata attributes.

newville commented 1 year ago

@woutdenolf Thanks, yeah I am finding this sort of exhausting. I am trying to help the XAS community out by fixing the NXxas definition so that it is not absolutely useless.

I meant that names might be "feka_mca1", "feka_mca2" in one file and then "Fe_mca1", "Fe_mca2" in another, and "Pt La ch1", "Pt La ch2" in another

Ok I see now. Thanks for clarifying. Depending on the edge you are probing, the column names are different, typically in the fluorescence case.

No, I mean that depending on the beamline, the time of day, the edge, or the person who set up the scan, the column names might be different. "FeKa_mca1", "Fe chan1", "Fe Ka chan1", Fe K_alpha1 ch1", .... can all mean "The Fe Kalpha ROI for detector channel 1 out of 16". The number of variations is endless. They will all be used. You will not have a machine reading these.

I'm not sure however this is a good idea.

So, I am finding this all very confusing. What, specifically, do you find is not a good idea? I'm trying to say is

The raw data table should be saved and the column labels used should be saved.

You seemed to be weirdly focused on that and sort of misunderstanding the intent.

The edge we are probing is already saved somewhere else.

So? These are the labels for the columns of raw data. Some of them may be names of fluorescence ROI channels. Some of those might include the name of the edge. That is often how data is collected.

A data analysis software just needs to know which dataset(s) are the fluorescence signal (a single channel, 2 channels like in the example or many channels like in full-field). Allowing these dataset(s) to be named whatever will require a user selecting which columns are the fluorescence signals while the software should be able to infer this from the application definition.

Ugh. These are labeling the columns of the data table for the raw data. That might be used to reprocess the data. The user or downstream apps might want to do that.

If we don't give explicit meaning to datasets (essentially how does this signal relate to the linear attenuation coefficient for traditional XANES/EXAFS), we don't need NXxas at all.

All the datasets have explicit meaning. There are already other datasets explicitly labeled "i0", "itrans", "ifluor", and "irefer" with clear meanings. The 2D data table here gives the full, original scan data table, in case that might be useful.

We could just save the 2D table in an NXdata under an NXentry and be done with it.

Um, yeah, that is sort of what we're doing. Putting a 2D table under NXscan -- the raw data table for the scan.

There could be other ways to tag signals with meaning while still allowing random names. Nevertheless, adding meaning to signals is necessary imo. And it does not prevent you from adding other signals with unknown meaning, although this is not very useful for software that need to process NXdata data autonomously.

The data table has meaning, and the columns are labeled. I think I may be out of different ways to explain this.

Indeed. So having an ifluor or random names like "feka_mca1", "feka_mca2" is not helping.

That is completely untrue. Having "ifluor" and/or "itrans" is completely clear. It is the definition.

The raw data table may also consist of columns that are labeled with not-so-easy-to-parse and use names. That are not meant to be the primary data but could be very helpful if the user or downstream app wants to re-process some of the data, especially the data that went into "ifluor".

For tomography we tag signals to be dark, flat, projection etc. We should do the same here (fluo, livetime, transmission, ...).

.... yeah, except a) we already know that we want to call those "ifluor" and "itrans" because we have existing definitions we are trying to follow b) we know that there is no standard name or even way to express dead-time information like "live time". It can be "live time" and "real time", it can be "output count rate" and "input count rate", some detectors have a second "fast live/deadtime", some write only "output count rate" and a scalar value of "tau", and some detectors simply write out a "deadtime correction factor".
Some people do not use any of these, but scale by the intensity of some other ROI. Really, all of these are used and not wrong. This is why we want a raw data table with the names of the columns so that the knowledgeable downstream user is able to re-do or at least assess how this was done. This will not be the normal process (that would be just using the "ifluor" channel saved elsewhere), but the raw data table will allow that.

I think you may need to trust that we understand what we have given this some thought ;).

Side node: NXfluo could benefit from the same refactoring. It is just as useless as the current NXxas.

Yep, I agree. As it turns out, I don't know of a NeXus definition that I would use for any data I take ;)

So I see two separate but perhaps related questions:

  1. How do we give meaning to all signals (by giving them fixed names or by other means)

All signals have meaning. All have fixed names.

  1. How do we divide signals over datasets or fields (dataset is the HDF5 term, field is the NeXus term)

As for point 1, there are two proposals for the NXdata group in NXxas (not talking about the NXdetector's here). The first one is a 2D table with string-values axes along the second dimension (requires #1246 to be accepted first)

Do you mean NeXus does not support a list of strings? Sheesh!

data:NXdata @axes = [".", "column_names"] @signal = "raw_data" column_names = ["energy", "i0", "itrans", "feka_mca1", "feka_mca2"] raw_data: float[1000, 5]

Yes, definitely this. Well, except it could be called "NXscan/data" and is only explaining the raw scan data, not the other data arrays provided in the file.

The second one is a more traditional NeXus approach data:NXdata @axes = ["energy"] @signal = "itrans" @auxiliary_signals = ["energy", "i0", "feka_mca1", "feka_mca2"] raw_data: float[1000] i0: float[1000] itrans: float[1000] feka_mca1: float[1000] feka_mca2: float[1000]

Well, I don't really understand what you would do with things with the names for the other 50 to 100 columns. Or why someone would want that. We have a small number of 1D data arrays that we definitely want available: "energy", "i0", "itrans", "ifluor", and "irefer" (and, just to be clear: really those names). We also have a 2D set of "raw data" that we want to save. I will not be able to say this any other way.

In terms of reading performance, approach 1 is probably faster if you read everything in memory at once.

Probably.

However since signals are typically saved in individual datasets by the acquisition system, raw_data will have to be a virtual dataset in which case I'm not so sure it will be faster than approach 2.

What acquisition system are you talking about? Data for scans like this are typically held in memory and/or streamed to disk (could be any number of types of databases) until the scan is complete when a file is then written (or if being streamed, closed). Individual channels will not be written separately, they will be read and/or saved and/or streamed on a row-by-row basis for either step or slew scans, and the integrity and synchronization of the row will be of paramount importance as data is collected and sent to disk.

I would definitely suggest that the raw scan data will be a real dataset. It could be saved row by row (think spec scan) or at once as a 2d table. Anyway, write speed is really not important here.

Approach 2 is probably faster if you only read the columns the user is interested in. No need for slicing when reading, especially tricky when keeping chunk size, chunk caching and compression in mind.

The raw data table for a scan will rarely exceed 200 columns and 1000 data points. That's 1.6 Mb. Go ahead and increase it by 10x in each directions, the raw data table will fit in memory, uncompressed.

newville commented 1 year ago

I would like to come back to this discussion, partly because I will be presenting this effort at an XAFS workshop (https://www.ansto.gov.au/whats-on/q2xafs-2023-international-workshop-on-improving-data-quality-and-quantity-xafs) at the end of next week. From the point of view of "how to encode XAS data into NeXuS, and is that a good idea?", I wrote some discussion at https://millenia.cars.aps.anl.gov/nxxas/, with worked examples of XAS data using the NeXuS schema proposed here at https://millenia.cars.aps.anl.gov/nxxas/nexus_xas.html#worked-example-of-xdi-and-nexus-formatted-xas-data Comments or suggestions on that (or on this PR) would be most welcome. I will also ask for input from the wider XAS community. I should say that although I have been somewhat skeptical for a long time, I have come to believe that using NeXuS/HDF5 is actually the best way to share XAS data in databases, supplemental materials, online catalogs of data, and so on.

It is not clear to me what "merged" or "accepted" here means. Is the idea that some downstream software would validate against these definitions? Or are the definitions meant to be "canonical" but not necessarily the only acceptable definitions in use?

phyy-nx commented 1 year ago

@newville for merged and accepted, are you referring to getting this PR merged in? I think two things need to happen first: @woutdenolf and yourself should probably meet by zoom to resolve the concerns listed above. I just suggest that as I bet it would be faster than back and forth here. Second, looking at your post above, I think getting some feedback from your community makes a lot of sense. From my end, I'm happy to see this PR take as long as needed to get it as right as possible.

Is the idea that some downstream software would validate against these definitions?

Yep, that's one of the central tenets of NeXus: we provide definitions and validation of NeXus files against those definitions

Or are the definitions meant to be "canonical" but not necessarily the only acceptable definitions in use?

Canonical is a fine term, and not necessarily the only definitions in use is ok too. Facility-specific metadata can be added at will, NeXus is supposed to be flexible in that regard. The idea is if there are terms that are used broadly that users would like added, they can be when they are widely adopted.

woutdenolf commented 6 months ago

After several discussions in the NXxas working group, we decided to start in a new PR https://github.com/nexusformat/definitions/pull/1347