usnistgov / PyHyperScattering

Tools for hyperspectral x-ray and neutron scattering data loading, reduction, slicing, and visualization.
Other
7 stars 9 forks source link

refactor: xr.Dataset as primary data structure #62

Open martintb opened 1 year ago

martintb commented 1 year ago

Many of the pain points of working with multi-indices can be mitigated by reworking pyhyper to use xr.Datasets rather than xr.DataArrays. One example would be avoiding memory intensive unstack() commands.

Datasets would also allow easy switching between different coordinates and storage of non coordinate data associated with the experiment.

More details and examples to follow.

pbeaucage commented 1 year ago

related to #39

this is a good idea and not hard to implement, but is a large API break.

pdudenas commented 1 year ago

Any thoughts on how we should structure the fundamental dataset? Going off of the xarray example they structure their data like this:

For us an abstract example could be something like this:

Do we name each data_vars entry by their scan name?

And does moving to a dataset inherently solve #39 or do we need to be careful in how we structure the dataset to avoid those same issues?

pbeaucage commented 1 year ago

I would make scan_id and related things like edge, polarization, temperature, etc coordinates of the dataset.

The data variables would then be a standard set of terms like scattering intensity, intensity uncertainty, incident intensity i0, transmitted intensity it, sample drain current, and possibly (where supported) instrument specific terms or secondary measurements.

The major change is from single scan or single experiment (thermal anneal, shear series) being the primary structure to the primary structure being a whole experiment as a series of samples. This is sort of like loadSeries in SST1RSoXSDB. The major challenge here will be performance, I believe. Dask might help offset that some.

pbeaucage commented 1 year ago

One bit of context that @martintb understands better than I do is that I think something throwaway like scan_id can be the dimension, and data variables like edge/temperature/etc can be promoted or demoted from being coordinates at will, in a fairly performant way (compared with MultiIndexes).

pbeaucage commented 1 year ago

52 is an example of something like this, where data can be simultaneously labeled with pix_x and pix_y and q_x and q_y at the same time. In that example we pop the data back out of being a Dataset after that coordinate swap, but if the data were already a dataset we wouldn't have to.

pdudenas commented 1 year ago

So you'd do something more like this?

Repeatedly stacking scans like that could be a source of performance issues, like you mentioned.