Add support for the new EDAX .hd5 format

jlaehne commented 2 years ago

As brought up on gitter by @TommasoCostanzo, it would be nice to support the new EDAX .h5 (hdf5) file format for EDX measurements.

The main thing is that one file can contain quite a number of scans. Hierarchically, the file is divided in samples, then areas and each of them can contain multiple spectra, linescans and/or maps. So we would need a good mechanism for choosing whether to import one or several elements from the file - and which exactly - and then to correctly transform them into hyperspy object(s).

As APEX-EBSD is also using .h5 (though the image files are saved separately in .up2 format), KikuchiPy @hakonanes seems to already have a certain support for the new EDAX format? And the hierarchy should be similar between EDX and EBSD files.

hakonanes commented 2 years ago

The relevant lines in kikuchipy are https://github.com/pyxem/kikuchipy/blob/develop/kikuchipy/io/plugins/h5ebsd.py#L570.

The design of our reader isn't the best, since we have one H5ebsd reader, which can read Bruker's, EDAX' and kikuchipy's own HDF5 files (there aren't many commonalities between the formats other than that they are HDF5 files). It arguably should be three different classes inheriting commonalities from a fourth private class... We support returning multiple EBSD signals from one EDAX HDF5 file, if the user passes the scan group names within the file. Otherwise, the first scan encountered will be returned.

I've stolen many of the HDF5 read/write functions from HyperSpy (thanks). Feel free to nick whatever you find useful.

reetuelzajoseph commented 11 months ago

The latest extension is .edaxh5. The structure for EBSD looks like this similar to the way @jlaehne described above:

For the EDS dataset the structure looks this:

jlaehne commented 11 months ago

@mkuehbach is also working on reading the edaxh5 format for edx and ebsd into python in the context of https://github.com/FAIRmat-NFDI

mkuehbach commented 11 months ago

I have implemented Python code which parses all of the data nodes in the above-mentioned EDAXH5 HDF5 file(s). The key question is how to share such code and make it useful for hyperspy @ericpre A hyperspy APEX reader is in my opinion the most inclusive approach so that all other projects can then just use it via hyperspy's parsing capabilities for such files. Sure if projects would like to have fewer dependencies, one could go for a h5py/numpy only parser but that I think is niche, I rather think electron microscopists also need for their Python code other functionalities, like the X-ray spectra database for EDS because of which it makes much more sense to me to pay the price to have hyperspy as a dependencies in ones tool. For this reason we e.g. currently have hyperspy as a dependency in the pynxtools https://github.com/FAIRmat-NFDI/pynxtools for the NOMAD project https://nomad-lab.eu/prod/v1/staging/docs/developers.html

From what I currently see in the German National Research Data Infrastructure consortia landscape there is likely not as much interest for the majority of use case to go technically too deep, i.e. that there is a need to parse over every field and piece of information but rather keep this for ones records in the EDAXH5 files and copy over e.g. spectra or some metadata. EDAXH5 files should always be compressed. At least the examples which I have seen have a low information entropy. EDAX does not do in-place compression right now which is also why files are huge.

Internally, EDAX mainly stores what I assume are raw C/C++ structures from APEX dumped into HDF5 compound data types, de facto this is one key part of how one can say EDAX does indeed very practically and useful document their software but what do certain fields mean? This is not explained anywhere inside the HDF5. EDAX does not do in-place compression right now which is also why files are huge.

The much more important practical question though is what do the many terms in these fields mean: The complete glossary as based on examples is parsed out here: https://github.com/FAIRmat-NFDI/em4nfdi Every community feedback to resolve terms in the glossary via a simple text or Excel file feedback is appreciated, especially to clarify units etc. Ideally this should be merged back into a glossary of the EDAXH5 formats, or easier, representatives from EDAX could help here to avoid such reverse engineering.

@ericpre Do you think there is a better place where such a collection of reverse engineered Rosetta stone for EDAXH5 could be stored? If so we should have a meeting with @reetuelzajoseph to make this happen.

Many fields and attributes are low-level details and parameters of algorithms, hardware details, some of which with unclear provenance how they end up in the EDAXH5 file when one uses certain versions of APEX. This brings the question how trustworthy is it to have such data/metadata be consumed by / on display in research data management systems, knowledge graph implementations, or tools like hyperspy.

Three things need to happen next:

[x] Technically read EDAXH5, for this I have code, solved problem
[ ] Identify what fields and terms mean conceptually This is either a trivial task for EDAX representatives or an incomplete community-based reverse engineering effort. In case of the latter feel free to open issues here https://github.com/FAIRmat-NFDI/em4nfdi
[ ] Already in parallel to identifying the meaning of terms the following should happen for hyperspy: answer which data and metadata are relevant for hyperspy and pull them over into dictionaries

@reetuelzajoseph and team: feel invited to contribute here FYI: @sanbrock

hakonanes commented 11 months ago

@reetuelzajoseph and @mkuehbach, great that you want to push for parsers of EBSD and orientation data in RosettaSciIO. It would be good if this work was done in conjunction with moving parsers from kikuchipy to RosettaSciIO (which I plan to start on this autumn, or, alternatively, can help people with moving then). This is to avoid duplication of efforts and so that as many people as possible benefit from the developments.

We have readers for EBSD data (not orientation data) acquired on square grids (hexagonal unsupported) in the following EDAX formats in kikuchipy:

Binary .up1 (8-bit), .up2 (16-bit). The reader is based on work by @drowenhorst-nrl for PyEBSDIndex
HDF5 (reader using the HDF5 reader base class). This reader is based mostly on files I've received from colleagues, such as @CiosG.

These are established, published readers (to PyPI) which are actively in use and have been patched a couple times (they are supported).

The key question is how to share such code and make it useful for hyperspy

My suggestion is to implement a reader here in RosettaSciIO.

[...] following should happen for hyperspy: answer which data and metadata are relevant for hyperspy and pull them over into dictionaries

kikuchipy is HyperSpy's extension for EBSD data. The current reader of EDAX' HDF5 files in kikuchipy reads the patterns, detector information (the pattern centers, camera tilt, sample tilt, camera azimuthal [about vertical]), the map step sizes, working distance, and magnification. I'd start with this.

kikuchipy and HyperSpy's extension for diffraction in TEM, pyxem, both use orix for handling of orientation data. There are many readers of orientation data in orix, however, a reader for orientation data from EDAX HDF5 files is missing. RosettaSciIO so far does not have any readers of orientation data. If it is decided that reading orientation data is within the scope of RosettaSciIO, we should move IO from orix to RosettaSciIO. Again, coordinating efforts would be best. Readers in orix should return dictionaries of arrays and such, but as of now they return a CrystalMap class instance, which is similar to MTEX' EBSD class. A CrystalMap expects this information. I'd start with this.

CSSFrancis commented 11 months ago

We probably need to think a little bit more about what happens when a signal doesn't fit nicely into a hyperspy signal. I think that we should support every file format that people are willing to add, but we probably need to just add a flag that this file format isn't supported by hyperspy (and maybe a suggestion of what software to use).

@jlaehne Do you have any thoughts on that?

hakonanes commented 11 months ago

We probably need to think a little bit more about what happens when a signal doesn't fit nicely into a hyperspy signal

This discussion probably deserves its own issue?

ericpre commented 11 months ago

Thank you @reetuelzajoseph and @mkuehbach for your comments.

I have implemented Python code which parses all of the data nodes in the above-mentioned EDAXH5 HDF5 file(s). The key question is how to share such code and make it useful for hyperspy @ericpre

@mkuehbach, as already mentioned by @hakonanes, rosettatsciio will be a very good place to share such a reader and will be readily available to hyperspy users or other libraries using hyperspy - rosettasciio will be start to be used in hyperspy 2.0, which still need to be released, but there is good progress on that front! If you are interested to do that, that would be great and please provide feedback: as most of the readers have growth organically over the years, which overall has been very useful and successful, there is quite a lot of room for improvement on the consistency between readers, documentation, on the way things are being done, etc.

The much more important practical question though is what do the many terms in these fields mean: The complete glossary as based on examples is parsed out here: https://github.com/FAIRmat-NFDI/em4nfdi Every community feedback to resolve terms in the glossary via a simple text or Excel file feedback is appreciated, especially to clarify units etc. Ideally this should be merged back into a glossary of the EDAXH5 formats, or easier, representatives from EDAX could help here to avoid such reverse engineering.

@ericpre Do you think there is a better place where such a collection of reverse engineered Rosetta stone for EDAXH5 could be stored? If so we should have a meeting with @reetuelzajoseph to make this happen.

We started to collect known information about file formats/specifications to make it more accessible but there was little effort on this and this is something that we should enforce it, particularly when implementing new formats/functionalities. Quite often, something is implemented using reference materials available online and at some point, this materials is not available anymore... so we should make sure that we don't lose this information. Of course, there is a lot of information within the reader themselves, but this is not easy to read or poorly documented and we should have an documentation of the file format itself as part of the documentation. This would help a lot with maintenance! As this should be done at the same time as new functionalities/formats are implemented, I think that it would make sense that it is added to the rosettasciio documentation and this is something that the community could help with during review, etc.

To give more context on rosettasciio:

the IO code has been split recently from hyperspy to make with the aim more accessible to other libraries/frameworks.
In term of metadata handling, the current approach is pragmatic, in a sense that we try to read all possible metadata as they come into the original_metadata and parse some of them to metadata into a tree structure, which has been defined in ad-hoc fashion. This is working but not very well and this is something that will need improvement sooner than later, because this is still very hyperspy centric and hyperspy has its own set of metadata, that is used for various functionalities. Hopefully, at some point, we can start to use an established standard and ongoing effort by other initiatives should help with this! 😃
The public API is currently very minimal, but as the community grows and we gain better understanding on what API would be useful, it is likely that the public API will be extended.

ericpre commented 11 months ago

We probably need to think a little bit more about what happens when a signal doesn't fit nicely into a hyperspy signal

This discussion probably deserves its own issue?

Yes, can you please open an separate issue? You will have a better idea than me on this to describe the needs, whether it would make sense or not, etc. Afew readers read simple process data (for example, EDS map) but orientation data may have structure which is more complicated than that - I don't know! 😉

mkuehbach commented 11 months ago

For the most general case orientation data are de facto point cloud with mark data. For each point in R^2 or R^3 (provided some serial-sectioning reconstruction was performed) there are associated mark data, that is phase (an id from a dictionary of phases against one indexed or e.g. a marker notIndexed), orientation value tuple (based on parameterization), and pattern and eventually indexing quality descriptors (bc, ci, etc.), In that sense EBSD data are different although for simplicity they are typically regridded to regular 2D or 3D grids to support stencil operations.

From what I've seen in the literature https://github.com/FAIRmat-Experimental/nexus-fairmat-proposal/blob/main/9636feecb79bb32b828b1a9804269573256d7696/_sources/classes/contributed_definitions/NXem_ebsd.rst.txt is the most complete and systematic approach.

With the newer AI/ML-based indexing solutions, comparing, i.e. indexing simulated with measured patterns may yield probabilistic/fuzzy descriptors for quality of the match. This is different to traditional approaches e.g. confidence index which are evaluated for essentially indexed or not against threshold. When I wrote NXem_ebsd, I drafted it so general with the aim that most people can use it irrespective whether they just have simple grids or randomly sampled beam positions with measurements. Ignoring for now all existence constraints mentioned in NXem_ebsd the taxonomy within NXem_ebsd gives quite a complete view about how EBSD data are pumped out by software (open source or technology partner).

@ericpre thanks for your feedback, will follow, currently implementing s.th. else what is the time line of the hspy 2.0 release so that we can plan accordingly and decide how to prioritize implementing this.

ericpre commented 11 months ago

@mkuehbach, this should released between mid- and end of September.

hyperspy / rosettasciio

Add support for the new EDAX .hd5 format #16