pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Writing complex numbers to netCDF #9246

Closed ZedThree closed 1 month ago

ZedThree commented 4 months ago

Is your feature request related to a problem?

Currently, the only supported method for saving complex numbers to netCDF is to use the h5netcdf engine with invalid_netcdf=True.

The latest release of netCDF4, 1.7.1, now has support for complex numbers. Also, because it's built on the up-coming netCDF 4.9.3, it includes support for reading files written using h5netcdf with invalid_netcdf=True (and I think for pretty much all other types that fall under this, including enums and bools).

This means that there are now more options for users wanting to read/write complex numbers, although this depends on the version of netCDF4 in use and the flags used to open files.

Just to note that the complex number support in netCDF4 is done via converting to/from a compound datatype (or a complex dimension for netCDF-3 files), and is completely standard netCDF -- it's basically just a wrapper in the Python API.

Describe the solution you'd like

Either expose the auto_complex keyword for netCDF4.Dataset to the backend or set this automatically. This will either require using netCDF4>=1.7.1 or detecting the version.

Additionally, invalid_netcdf will no longer be strictly necessary for bools and complex numbers for h5netcdf, although again this will be dependent on the version of netCDF used -- this is sort of outside of xarray though.

I'm happy to implement this as it's pretty trivial, but is it better

Describe alternatives you've considered

No response

Additional context

There is currently an issue mixing netCDF4==1.7.1 and h5py==3.11.0 in the same python process, see: https://github.com/Unidata/netcdf4-python/issues/1343 and https://github.com/h5py/h5py/issues/2453. This might need to be resolved before xarray can require netCDF4==1.7.1

Any insights on how to resolve this would be much appreciated!

shoyer commented 4 months ago

Very exciting to see progress on this! Thanks for pushing this upstream.

Just to confirm, what is the minimum version of netCDF-C required for this functionality?

ZedThree commented 4 months ago

Just to confirm, what is the minimum version of netCDF-C required for this functionality?

For complex written via netCDF4: no minimum version, uses standard netCDF functionality (does depend on which netCDF-3/4 file format you use though)

For complex written via h5netcdf/h5py: 4.9.3 (unreleased, current main branch on github, but release expected imminently)

shoyer commented 4 months ago

What about support for tools that read netCDF via the netCDF-C library without using Python, e.g., ncdump?

I ask because I want to clarify that support for complex numbers is "official" for netCDF, which as far as I understand basically means that it's in the netCDF-C library. Right now, the documentation for netCDF4-Python suggests that complex number support requires the nc-complexhttps://github.com/PlasmaFAIR/nc-complex).

ZedThree commented 4 months ago

Ah sorry, to be clear, support for complex numbers is still not "official" for netCDF and not available directly in the netCDF-C library. The official Python interface has some helper functionality through the auto_complex keyword. The representation on disk uses existing netCDF-C functionality, with nc-complex handling the conversions between complex types and netCDF datatypes. This means that using nc-complex to write complex numbers doesn't affect other tools ability to read those files.

netCDF4.Dataset has auto_complex=False by default to emphasise that this is a helper utility.

ncdump will show the underlying representation:

import netCDF4

complex_array = np.array([0 + 0j, 1 + 0j, 0 + 1j, 1 + 1j, 0.25 + 0.75j])

with netCDF4.Dataset("complex.nc", "w", auto_complex=True) as nc:
    nc.createDimension("x", size=len(complex_array))
    var = nc.createVariable("data", "c16", ("x", ))
    var[:] = complex_array

With the current netCDF-C release, 4.9.2:

$ ncdump complex.nc 
netcdf complex {
types:
  compound _PFNC_DOUBLE_COMPLEX_TYPE {
    double r ;
    double i ;
  }; // _PFNC_DOUBLE_COMPLEX_TYPE
dimensions:
        x = 5 ;
variables:
        _PFNC_DOUBLE_COMPLEX_TYPE data(x) ;
data:

 data = {0, 0}, {1, 0}, {0, 1}, {1, 1}, {0.25, 0.75} ;
}

There's nothing special about this file, and it should be readable by any tool that can read standard netCDF-4 files, but they'll get a compound datatype:

with netCDF4.Dataset("complex.nc", "r") as nc:
    print(nc["data"])
    print(nc["data"][:])

# <class 'netCDF4._netCDF4.Variable'>
# compound data(x)
# compound data type: {'names': ['r', 'i'], 'formats': ['<f8', '<f8'], 'offsets': [0, 8], 'itemsize': 16, 'aligned': True}
# unlimited dimensions: 
# current shape = (5,)
# [(0.  , 0.  ) (1.  , 0.  ) (0.  , 1.  ) (1.  , 1.  ) (0.25, 0.75)]

with netCDF4.Dataset("complex.nc", "r", auto_complex=True) as nc:
    print(nc["data"])
    print(nc["data"][:])

# <class 'netCDF4._netCDF4.Variable'>
# compound data(x)
# compound data type: complex128
# unlimited dimensions: 
# current shape = (5,)
# [0.  +0.j   1.  +0.j   0.  +1.j   1.  +1.j   0.25+0.75j]

Creating the same file with h5netcdf and invalid_netcdf=True, and using the development version of netCDF-C's ncdump:

netcdf complex {
types:
  compound _AnonymousCompound1 {
    double r ;
    double i ;
  }; // _AnonymousCompound1
dimensions:
        x = 5 ;
variables:
        _AnonymousCompound1 data(x) ;
data:

 data = {0, 0}, {1, 0}, {0, 1}, {1, 1}, {0.25, 0.75} ;
}

Literally the only difference here is netCDF4/nc-complex creates a named type, while h5py uses an unnamed type that the netCDF-C library gives a temporary name to. Any tool that uses this version of netCDF-C (or uses HDF5) will be able to read this file.

(In principle, h5py could commit this datatype to the file, and then current versions of netCDF would be able to read it too)


Lastly, HDF5 is now adding native support for complex numbers that will also be able to read the {r, i} compound datatype used by h5py and nc-complex. I believe this will be in 1.15. After that, we might be able to get native support into netCDF-C, but I suspect that will take some time.

Sorry for the info dump, there's multiple libraries and versions thereof, so I just wanted to be really clear about what all the different capabilities are!

shoyer commented 4 months ago

(In principle, h5py could commit this datatype to the file, and then current versions of netCDF would be able to read it too)

This would be very nice!

Overall, it seems that things have changed for the better since when I first added support for writing invalid netCDF into h5netcdf, because netCDF-C can now read the data associated with these HDF5 enum types.

At this point, I would say complex number support could be considered a "convention" for writing netCDF files. That means we could add direct support for reading/writing complex numbers in Xarray, without requiring invalid_netcdf=True.

ZedThree commented 4 months ago

Ok, I'll have a stab at a PR for this then!