recipy / recipy

Effortless method to record provenance in Python
https://recipy.readthedocs.io
Apache License 2.0
437 stars 41 forks source link

add support for xarray patching #176

Open rabernat opened 7 years ago

rabernat commented 7 years ago

This looks like a fantastic project with great potential to enhance scientific reproducibility. Thanks to the developers for all of your efforts.

I wanted to open an issue to suggest adding patch support for xarray https://github.com/pydata/xarray

Xarray is complementary to pandas and provides an interface for loading, analyzing, visualizing, and outputting labeled multi-dimensional array data. Its adoption is increasing rapidly in physical sciences, finance, and other fields.

jvdzwaan commented 6 years ago

@rabernat do you maybe have example files that can be opened using open_mfdataset, open_rasterio, open_zarr, and open_dataarray? Preferably small files, that I'm allowed to add to recipy as test data.

rabernat commented 6 years ago

Thanks for looking into this!

A good way to proceed would be to use the xarray tutorial datasets, which live in their own repository: https://github.com/pydata/xarray-data

These can be opened via the xarray.tutorial module, as shown in the xarray docs: http://xarray.pydata.org/en/latest/examples/multidimensional-coords.html Or you can just open them directly.

There are unfortunately no zarr or rasterio (i.e. geotiff) datasets there yet. I recommend you get started with netCDF files, which represent 95% of xarray use cases. In the meantime, I will work on providing examples of the other formats.

jvdzwaan commented 6 years ago

@rabernat Thanks! In the mean time, I found sufficiently small netcdf data for testing the patch of open_mfdataset. For open_rasterio, I just used a standard tiff file, since we are only interested in determining whether the input is logged and not whether it really is a geotiff.

So, now I'm just looking for zarr data.

rabernat commented 6 years ago

I can prepare a zarr file for you.

When you say "small files", can you be more specific?

Note that zarr can read from a wide range of stores (see xarray docs and zarr docs). How important is it to cover all of these different cases?

rabernat commented 6 years ago

Also, since zarr datasets can be opened directly by the zarr library, you might want to consider a dedicated patch for zarr (without xarray). I anticipate that zarr will grow in popularity over the coming years.

jvdzwaan commented 6 years ago

It would be great if you can prepare a zarr file! Small is kilobytes (the netcdf-files I found are 67kb). The data does not have to make sense, but must be valid.

I'll look into the different zarr storage types later. I'd like to cover as much file-based storage types as possible.

And correct me if I'm wrong, but all data loading/saving methods in xarray seems to come from other libraries (I started with a patch for netcdf4). So maybe we don't need a patch for xarray at all 😄 (I'll finish it anyway)

rabernat commented 5 years ago

Is this waiting for me?

jvdzwaan commented 5 years ago

No, don't worry about it! The work of patching xarray is more or less done (if you could prepare a small zarr file for testing that would still be helpful). I plan to merge this feature soon, but there are some things that need to happen first.