informatics-lab / tiledb_netcdf

An adapter to convert NetCDF files to TileDB arrays
GNU General Public License v3.0
31 stars 7 forks source link

some design questions for netcdf compatibiity #49

Open rabernat opened 4 years ago

rabernat commented 4 years ago

I love this library and what it aims to accomplish. Kudos for working on it and sharing with the community! I hope you don't mind if I share some thoughts and feedback.

I think it's worth discussing some broader design issues about how to integrate new storage formats (tildeb and zarr) with the netcdf data model. @jhamman and I implemented the Zarr backend for Xarray so have some experience here. In that implementation, we made what I consider to be a rookie mistake: we wrote an implementation before defining a standard. We basically made some ad hoc choices about how to store the netCDF data model into Zarr in order to make it work quickly. In doing so, we effectively defined a new convention which was documented only by reading the xarray source code. (We have retroactively documented it here: http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification.) This was a bad approach because it made it hard for other libraries / languages to parse the "netCDF" (in scare quotes) data that was written by Xarray.

Instead, we should have first defined a convention, socialized it with other people (including Unidata; cc @wardf), documented it, and then written an implementation. It's not too late to do that for tiledb! This way, Julia or R users could read / write netcdf-compliant tileDB data in a reliable way.

A second choice we made, which is not so obviously a mistake but definitely something to consider, is how we structured the compatibility layer. We chose to implement a standalone backend within xarray. Contrast this to @shoyer's approach in creating h5netcdf. In that library, he essentially re-implemented the netcdf4-python API, but without the netcdf dependency, using only h5py plus the established convention for storing netCDF in hdf5. The choice to re-use the existing netcdf4-python API was very smart, because then all downstream code built for using netcdf4-python (including xarray) required minimal changes to work with h5netcdf. tiledb_netcdf could follow that model. Among other advantages, this would allow us an easy path to adding an tiledb backend to xarray. Instead, in its current form, tiledb_netcdf appears to have defined a new API, which might be harder to maintain in the long term.

To summarize my recommendations, take them or leave them, are to:

cc @normanb, who I know is interested in this.

normanb commented 4 years ago

Thanks @rabernat, all great points! Defining a convention is important as TileDB has multiple language and tool integrations already.

One important feature on our roadmap (with feedback we’ve gotten from the folks at the Informatics Lab) is pushing “axes labels” down to the core C++ library, which will help to simplify the mapping of netCDF conventions into a TileDB Array.

https://feedback.tiledb.com/tiledb-core/p/support-axes-labels

To define a standard for representing netCDF data as TileDB arrays we will need to work with the community to experiment and integrate with existing libraries. We have working code for https://gdal.org/user/multidim_raster_data_model.html and we will also patch the existing netcdf4 python libraries as per h5netcdf to read a TileDB array directly. From there we will be able to draft a standard for review with the Informatics Lab, Pangeo Data WG and other interested parties such as the users of our R and Java communities.

xarray integration is important and we will be testing this as part of the work.

This is quite high in our priority list, so please stay tuned!

DPeterK commented 4 years ago

Hi @rabernat - thanks very much for your interest in this library! It's probably worth just clarifying why this library primarily exists and therefore what it does and doesn't offer: as part of the Informatics Lab's project Cloud-Ready Data we needed a quick adaptor to go between NetCDF and TileDB - and then on from TileDB to Iris/Xarray, because no such adaptors existed at the time. So in many ways, this is a library of time-bound necessity rather than particularly defining a canonical approach for anything.

As such, it indeed falls into the same trap you describe about the Zarr backend for Xarray: it defines an implementation more than thinking about a standard for the implementation. Given my background with Iris, however, the implementation is CF-like, or perhaps more accurately, CF-lite. Again, this is on account of needing a solution in a time-bound situation.

Certainly I think that having a standard for representing NetCDF data (and maybe more broadly earth system data regardless of originating format) in TileDB is really important - so that we don't end up with one representation per implementation, and a whole lot of incompatible data! That wouldn't help at all with providing cloud-first, analysis-ready data. Very happy therefore to continue engaging with TileDB and the NetCDF community on this.