hyperspy / rosettasciio

Python library for reading and writing scientific data format
https://hyperspy.org/rosettasciio
GNU General Public License v3.0
46 stars 28 forks source link

Improve metadata handling #89

Open jlaehne opened 1 year ago

jlaehne commented 1 year ago

Describe the functionality you would like to see.

As brought up by @francisco-dlp in https://github.com/LumiSpy/lumispy/issues/53#issuecomment-814772350, it would be desirable to have a more universal metadata handling. Currently, metadata is mapped from original_metadata in every file_reader independently following the HyperSpy conventions. If other packages would want to built on RosettaSciIO, this is not the most convenient. Also it does include a lot of redundant code. Instead, we could for example use something like yaml files to define the mapping, and then each folder could include a hyperspy.yaml, but potentially also other mapping files for other applications.

Of course, metadata mapping is not always 1:1 (node from one tree is directly mapped to position in other metadata tree), which can be done using a basic dictionary. The mapping definition would need to include several extra situations:

The developers of the https://github.com/nomad-coe/nomad repository/ELN have implemented a similar functionality based on what they call "schemas". Maybe, we can team up with them @markus1978, @haltugyildirim to implement such a mapping in RosettaSciIO, as the possibility to read in a number of (partly binary) data formats provided by RosettaSciIO should in turn be valuable to Nomad in order to support a broader range of experiments and to integrate processing via e.g. HyperSpy.

Additional information

Should not hold back an initial release, but should be on the roadmap.

francisco-dlp commented 1 year ago

Thanks @jlaehne for bringing back this important topic.

Indeed RosettaSciIO does map all metadata to HyperSpy's metadata specification. This comes with the advantage that it can translate all mapped metadata across formats (hence the link with the Rosetta Stone), but it is an overhead when this is not required. Therefore, it should be an optional feature (task 1).

As you rightly point out, the mapping to HyperSpy's metadata specification is not done very smartly. Ideally, one should be able to specify the mapping using an easy to maintain mapping specification file, e.g. in yaml (task 2). The task is far from trivial, and it is of interest beyond RosettaSciIO, so ideally it should be performed by an independent tool. Nomad's schemas seem like a good candidate.

Finally, HyperSpy's metadata specification is defined in the User Guide. It would be better to defined the metadata using e.g. Nexus' specifications or simply switch to Nexus' EM microscopy format (task 3).

ericpre commented 1 year ago

Now that there is nexus definition for electron microscopy, it would be great to use it and provide feedback on its usability.

jat255 commented 1 year ago

I wanted to share a few links to maybe push this discussion along (I think this is a great idea and would be interested in helping work on it, as interoperability is a critical part of a mature data ecosystem):

CSSFrancis commented 1 year ago

@jat255 These are all great resources. It does seem like there is a fair bit of duplication of efforts occurring in the community and it would be good to get ahead of that. Is there anyway we can bring more people into the fold/ integrate packages?

Developer time in the microscopy community seems to be very limited so anything we can do to reduce duplication is very valuable!

Maybe a meeting with all interested parties would help to get the ball rolling.

CSSFrancis commented 1 year ago

@jat255 It seems like it might be also worthwhile to send someone to a MaDRA meeting. I can attend, but don't know if I am the most qualified person to represent rosettasciio.