pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.6k stars 1.08k forks source link

xarray contrib module #1850

Closed shoyer closed 4 years ago

shoyer commented 6 years ago

Over in #1288 @nbren12 wrote:

Overall, I think the xarray community could really benefit from some kind of centralized contrib package which has a low barrier to entry for these kinds of functions.

Yes, I agree that we should explore this. There are a lot of interesting projects building on xarray now but not great ways to discover them.

Are there other open source projects with a good model we should copy here?

This gives us two different models to consider. The first "separate repository" model might be easier/flexible from a maintenance perspective. Any preferences/thoughts?

There's also some nice overlap with the Pangeo project.

nbren12 commented 6 years ago

Thanks for starting this issue @shoyer. One thing I would be interested to know is how sklearn and tensorflow balance code-quality and API consistency with low barrier to entry. For instance, most of the sklearn contrib packages provide classes which inherit from sklearn's Transformer, BaseEstimator, or Regressor classes, which ensures that all the contrib packages share a common interface.

benbovy commented 6 years ago

I like the idea of regrouping contrib projects.

I'd be +1 for the "separate repository" model, which looks indeed easier from a maintenance perspective. However, with this model it might probably be a good thing to also follow some package naming convention (see #1447 for discussion) so that we could easily identify contrib projects in, e.g., import statements or with package managers. I don't have strong opinion on this, though. Maybe it is too restrictive...

... which ensures that all the contrib packages share a common interface.

I'd see xarray contrib packages mainly provide Dataset or DataArray accessors that are too domain-specific to be added as "core" methods.

benbovy commented 6 years ago

Some additional thoughts:

One thing that I like with contrib modules "protected" within the xarray namespace is that it would really help us choosing module names that are short, relevant and ideally the same that the Dataset or DataArray accessors they provide.

However, it is likely that contrib modules may need domain-specific dependencies other than the ones used in xarray "core". With the xarray.contrib model we may end up with a lot of optional dependencies, which may be annoying, e.g., for ci or packaging with conda-forge. To me it would be too restrictive not allowing such specific dependencies in contrib projects.

shoyer commented 6 years ago

I think domain specific dependencies are a pretty decisive argument in favor of the separate repository model.

TensorFlow doesn't relax its code quality standards for contrib packages -- it's more about reducing guarantees of API stability or maintenance. That works OK for TensorFlow in part because the authors of most contrib packages are Google software engineers.

gajomi commented 6 years ago

I don't have any strong opinion about separate repos or contrib submodules, so long as there is some way to improve discoverability of methods.

Having said that, many of the methods mentioned in #1288 are in the numpy namespace, and at least naively applicable to all domains. Would you consider numpy methods with semantics compatible with DataArrays and/or Datasets as appropriate to contribute to core xarray?

nbren12 commented 6 years ago

I agree that the separate repository model is probably best. However, should it be in just one repository or in many?

Using many repos would solve the domain-specific dependency problem, but the sklearn-contrib packages are not that discoverable IMO. I found two of them via google on separate occasions before realizing that they were part of the same github organization.

benbovy commented 6 years ago

should it be in just one repository or in many?

One repository for all contrib projects would be hard to maintain if we allow very specific projects, like a little xarray extension to work with the 'xyz' GCM model (which seems to be a common case for extensions). That said, it doesn't prevent us from adding bigger, generic repositories like xarray-scipy.

but the sklearn-contrib packages are not that discoverable IMO.

Hence the suggestion to choose some convention for package naming, e.g., something similar to dask related packages: dask-learn, dask-glm, dask-xgboost, etc.

benbovy commented 6 years ago

To make methods even more discoverable, we might also add the x prefix to DataArray or Dataset accessors. This would work quite well with auto-completion, even though x alone is very often used as coordinate. Like suggested in #1447, we could have something like

$ conda install xarray-scipy -c conda-forge`
>>> import xarray as xr
>>> import xscipy
>>> da = xr.DataArray(...)
>>> da.xscipy.method()

But maybe that's too much x...

jhamman commented 6 years ago

My 2-cents. I think we could consider setting up an xarray-contrib organization. I don't see how a xr.contrib namespace buys us all that much, except for some additional book-keeping in the core xarray package. My thought would be to let individual projects decide 1) if they want to reside inside the xarray-contrib organization, and 2) whether or not to use the accessor api available in xarray now. We could easily add a page to the xarray docs that points to a collection of projects.

Side note, we don't have to use it but I did grab the xarray-contrib organization name just in case.

max-sixty commented 6 years ago

Re the comment from @benbovy

Even before this, let's put a list of projects that are closely integrated with xarray somewhere?

nbren12 commented 6 years ago

@maxim-lian There is a very short list of such packages hidden in the xarray documention.

In general, there are a ton of these awesome-... repos floating around the internet which just list the useful/related tools/libraries which are related to ... . For example, there are repos out there like awesome-python and awesome-bash. Maybe someone could start an awesome-xarray package.

shoyer commented 6 years ago

Personally I'd rather have "awesome xarray" listed somewhere prominently in the xarray docs, along with mentions inline in the docs anywhere where they are particularly relevant . The very short list that is currently there is based upon a handful of projects that I knew about a few years ago, but it's definitely woefully out of date now. On Fri, Feb 23, 2018 at 9:23 PM Noah D Brenowitz notifications@github.com wrote:

@maxim-lian https://github.com/maxim-lian There is a very short list of such packages hidden in the xarray documention http://xarray.pydata.org/en/stable/internals.html?highlight=xgcm#extending-xarray .

In general, there are a ton of these awesome-... repos floating around the internet which just list the useful/related tools/libraries which are related to ... . For example, there are repos out there like awesome-python https://github.com/vinta/awesome-python and awesome-bash. Maybe someone could start an awesome-xarray package.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1850#issuecomment-368201527, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1oUunEGU95WyDsCgTYpuXKdybftIks5tX5y4gaJpZM4RoiXN .

rabernat commented 5 years ago

FYI, we have started https://github.com/pangeo-data/awesome-open-climate-science. It is not xarray specific, but contains many xarray-related packages. Please contribute!

nbren12 commented 5 years ago

Thanks @rabernat that awesome list looks pretty awesome.

However, I would still advocate for a more centralized approach to this problem. For instance, the NCL has a huge library of contributed functions which they distribute along with the code. By now, I am sure that xarray users have basically reimplemented equivalents to all of these functions, but without a centralized home it is still too difficult to find or contribute new codes.

For instance, I have a useful wrapper to scipy.ndimage that I use all the time, but it seems overkill to release/support a whole package for this one module. I would be much more likely to contribute a PR to a community run repository. I am also much more likely to use such a repo.

I would be more than willing to volunteer for such an effort, but I think it needs to involve multiple people. Various individuals have tried to make such repos on their own, but none seem to have reached critical mass. For example, https://github.com/crusaderky/xarray_extras https://github.com/fujiisoup/xr-scipy I think there should be multiple maintainers, so that if one person drops out, there still appears to be activity on the repo.

rabernat commented 5 years ago

Just to add to the mix, we have our own package for spectra! https://xrft.readthedocs.io/en/latest/

On Apr 4, 2019, at 5:04 PM, Noah D Brenowitz notifications@github.com wrote:

Thanks @rabernat that awesome list looks pretty awesome.

However, I would still advocate for a more centralized approach to this problem. For instance, the NCL has a huge library of contributed functions which they distribute along with the code. By now, I am sure that xarray users have basically reimplemented equivalents to all of these functions, but without a centralized home it is still too difficult to find or contribute new codes.

For instance, I have a useful wrapper to scipy.ndimage that I use all the time, but it seems overkill to release/support a whole package for this one module. I would be much more likely to contribute a PR to a community run repository. I am also much more likely to use such a repo.

I would be more than willing to volunteer for such an effort, but I think it needs to involve multiple people. Various individuals have tried to make such repos on their own, but none seem to have reached critical mass. For example, https://github.com/crusaderky/xarray_extras https://github.com/fujiisoup/xr-scipy I think there should be multiple maintainers, so that if one person drops out, there still appears to be activity on the repo.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

teoliphant commented 5 years ago

A few comments:

1) You need to have separately managed repos as you don't want the natural limits of group organization to bottleneck and limit the growth of the ecosystem (there is a reason SciPy broke-up into scikits --- and it hasn't gone far enough)

2) xarray should reify its API as soon as possible and own it (you may be too late to pull-back on ill-advised APIs already).

3) A simple list like awesome xarray in a Github repo that is referenced by xarray docs goes a long way towards a discoverable set of packages and helping people find each other. A namespace like xscipy would also work (but see next comment).

4) We are working on producing scipy-like libraries that can work on arbitrary arrays (we call this informally uscipy). Perhaps uscipy and xscipy can join forces and define interfaces that assume labels may exist.

5) xarray can be a nice intermediate between up-stream scipy-like libraries, and implementation details like NumPy or xnd (or even dask). I'm quite sure that xarray could be backed by the low-level libraries in xnd (and it is a goal of xnd to support projects like xarray).

6) Long term, we at Quansight Labs are working on getting an array protocol into Python core itself. I suspect we should get labels put into that definition from the beginning, and will need feedback from this community to make that happen. Timing for this is a PEP by end of 2019. If someone is eager to work on this now, it could go faster.

shoyer commented 5 years ago

For what it's worth, TensorFlow has decided that bundling contrib modules into TensorFlow as tensorflow.contrib was a big mistake. It helped with discoverability, but resulted in a lot of confusion about what is a supported API and what isn't.

shoyer commented 5 years ago

@teoliphant thanks for sharing your thoughts!

I would be very happy to collaborate on what a protocol for labeled arrays in Python could look like. Xarray is one useful implementations of labeled arrays, but it's definitely not the only one.

nbren12 commented 5 years ago

I'd also like to thank @teoliphant for weighing in!

Bearing in mind the history of scipy, I agree that the xarray community doesn't need 100% centralization, but there should be some conglomeration. IMO, the current situation of "one graduate student/postdoc per package" is not sustainable.

rabernat commented 5 years ago

The approach we have been taking is to develop "micro-packages". We currently have three:

These packages share some common design principles. In particular, they are all fully lazy and dask-friendly, meaning that we can apply them to very large datasets (which is the main focus in our group). By keeping the packages small, they are more maintainable. Xgcm and Xrft probably have O(3) active contributors, primarily myself and grad students in my group. Small, but significantly different from 1. We use these packages heavily in everyday scientific work, so I know they are useful.

I would love to combine forces on a larger effort. However, we have limited time and effort. For now, however, this situation doesn't seem too bad. It's kind of compatible with what @teoliphant was suggesting in his comment 1 above. I'm not sure that some mega xarray-contrib package would have critical mass to be sustainable either.

nbren12 commented 5 years ago

To be clear, I think there is some optimal middle ground between the "mega xarray-contrib" package and the current situation. I think the "micro-package" approach works when the collection of micro-packages is being maintained by an active/permanent entity (e.g. Ryan research group). On the other hand, postdocs and grad students are very likely to leave the field entirely within a few years, at which point they will probably stop maintaining their "micro-packages".

rabernat commented 5 years ago

@nbren12 - the key difference for our micro-packages is that the primary maintainer is me, not my grad students, and I'm not going anywhere for now. 😉

I still agree that there is probably a better way to organize all of this. Just trying to share our perspective as an xarray-centric small research group.

andersy005 commented 4 years ago

The gentlest of bumps on this. Any updates or progress here?? :smile: A couple of us @NCAR ( Cc @kmpaul, @matt-long ) are interested in the outcome of this issue.

dcherian commented 4 years ago

@andersy005 what kind of update are you looking for? I assume you are about to implement some general functionality but what to know where to put it?

andersy005 commented 4 years ago

I assume you are about to implement some general functionality but what to know where to put it?

This is correct.

One of the things we've been exploring is a "general resample utility" that would both enable fluid translation between data at different temporal intervals (this is one of the use cases) and be aware of things like time boundary variable . The fundamental concepts here are analogous to ESMF's regridder's:

We have a general, low-level prototype in https://github.com/coderepocenter/AxisUtilities. We think that it would be beneficial to have this functionality in xarray instead of it residing in yet another xarray related package.

For the time being, my main question is: where (in xarray) would something like this reside?

Note:

I am happy to open a separate issue to discuss the merits of having this functionality in xarray.

Cc @maboualidev