usnistgov / PyHyperScattering

Tools for hyperspectral x-ray and neutron scattering data loading, reduction, slicing, and visualization.
Other
6 stars 8 forks source link

Export Function for (Integrated) Datasets? #97

Open BijalBPatel opened 11 months ago

BijalBPatel commented 11 months ago

Occasionally I find it useful to export the scattering dataset after integration, but the builtin xarray.to_netcdf() gives some clunky errors on datetime.datetime() attributes and attributes with nested dicts.

Would it be useful to build in export/load functions? Where should it go?

pbeaucage commented 11 months ago

Already exists in some form - look at PyHyperScattering.util.FileIO and the methods therein saveNexus savePickle loadNexus loadPickle

A function in FileIO that sanitized the attributes to allow NetCDF serialization would be very useful, as would documentation improvements around the existing save/load functionality.

BijalBPatel commented 11 months ago

I have a messy stub for netCDF, i can take this on during the hackathon. Pardon the formatting below:

import json
import copy

def saveScan(int_scans: xr.DataArray, outPath: str):
    """Saves an xr.DataArray containing scattering data to a netCDF file
Converts datetime attributes to strings (one-way conversion) and uses JSON.dumps()
to convert nested dicts to str (reverses on load with loadIntegratedScan)

Parameters
----------
int_scans : xr.DataArray
    xarray DataArray containing scattering data
outPath : str
    target output path (containing filename and extension)
"""

# Create output variable
int_scans_out = copy.deepcopy(int_scans)

# Convert problematic (non serializable) keys
keys = list(int_scans_out.attrs.keys())
for attr in keys:
    # Convert datetime to str
    if isinstance(int_scans_out.attrs[attr], datetime.datetime):
        int_scans_out.attrs[attr] = str(int_scans_out.attrs[attr])
    # Serialize dicts
    if isinstance(int_scans_out.attrs[attr], dict):
        # Identify as JSON'd by changing name
        newKey = "json_" + attr
        # Todo handle errors on unserializable key/values, for now just tries to convert to str
        int_scans_out.attrs[newKey] = json.dumps(int_scans_out.attrs[attr], default=str)
        del int_scans_out.attrs[attr]

# Save integrated data
int_scans_out.to_netcdf(outPath)
def loadScan(inPath: str):
    """Loads an xr.DataArray from netcdf generated by saveIntegratedScan()

    Attempts to revert JSON'd nested dict vars. Note probably doesn't preserve data types.
Parameters
----------
inPath : str
    target output path (containing filename and extension)
Returns
-------
xr.DataArray containing scattering data
"""

# Load from file
scans_in = xr.load_dataarray(inPath)

# Revert JSON'd vars
keys = list(scans_in.attrs.keys())
for attr in keys:
    # Identify JSON'd keys
    if "json_" in str(attr):
        scans_in.attrs[attr[5:]] = json.loads(scans_in.attrs[attr])
        del scans_in.attrs[attr]

# return loaded data
return scans_in
pbeaucage commented 11 months ago

Looks good!

Worth considering orjson (https://github.com/ijl/orjson) which correctly serializes numpy types.