pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

support for units #525

Open p3trus opened 9 years ago

p3trus commented 9 years ago

One thing that bother me alot is that pandas lacks good unit support (e.g. quantities, pint, ...)

Is there a chance that xray will support it?

jthielen commented 5 years ago

Oh, okay, having the fallback like that was how I thought about implementing it. (I'm sorry that I didn't describe that in my initial comment.)

So would the way forward be to implement DataArray.units_convert()/DataArray.units_to() and DataArray.units as you described right now, but wait for and/or delegate IO integration? Or, should there also be a fully-backwards-compatible IO integration implemented now (such as an optional kwarg on open_dataset and to_netcdf)?

shoyer commented 5 years ago

That said, I agree that the dedicated accessor is a pretty good interface, especially if you want more methods/property than units and to. Even then array.pint.units and array.pint.to() is pretty readable. This could make sense if we are concerned about different interfaces for different unit libraries.

The accessor route is also definitely more conservative than putting first-class unit support in xarray proper, which I like.

As for where to put it, that's sort of up to you. I think it's probably going to get complicated enough that it should into a library that can be installed, rather than being a boilerplate example in xarray's docs. It could be in xarray if it's going to be very minimal (e.g., only one method + one property) but if you want more than that -- and judging by pint-pandas I would guess you do -- then it should probably go into pint or a dedicated package, e.g., pint-xarray.

I would not write a general purpose xarray+units library unless you're particularly excited about doing that for some reason -- it's much better to start with a particular integration that works well than to make a half-hearted effort at something general purpose without well-motivated use-cases.

jthielen commented 5 years ago

@shoyer I agree, the accessor interface makes a lot of sense for this: it's more conservative on the xarray side, while also giving the most flexibility for the pint + xarray integration.

Based on your feedback and what I'd hope to see out of the pint + xarray integration, I'm thinking a pint-adjacent package like pint-xarray may be the best route forward. I'll create an issue on pint to inquire about that possibility. See https://github.com/hgrecco/pint/issues/849.

shoyer commented 5 years ago

The other virtue of the separate package is a faster release cycle. We can (and should!) still put a full example in the xarray docs.

For IO integration, I think the simplest choice would be to write utility functions for going to pint arrays from unit-free arrays with string “units” in attrs (and back). That will be easily composable with xarray’s existing IO functionality.

jthielen commented 5 years ago

Based the points raised by @crusaderky in https://github.com/hgrecco/pint/issues/878#issue-492678605 about how much special case handling xarray has for dask arrays, I was thinking recently about what it might take for the xarray > pint > dask.array wrapping discussed here and elsewhere to work as fluidly as xarray > dask.array currently does.

Would it help for this integration to have pint Quanitites implement the dask custom collections interface for when it wraps a dask array? I would think that this would allow a pint Quanitity to behave in a "dask-array-like" way rather than just an "array-like" way. Then, instead of xarray checking for isinstance(dask_array_type), it could for check for "duck dask arrays" (e.g., those with both __array_function__ and __dask_graph__)? There are almost certainly some subtle implementation details that would need to be worked out, but I'm guessing that this could take care of the bulk of the integration.

Also, if I'm incorrect with this line of thought, or there is a better way forward for implementing this wrapping pattern, please do let me know!

shoyer commented 5 years ago

Would it help for this integration to have pint Quanitites implement the dask custom collections interface for when it wraps a dask array? ... Then, instead of xarray checking for isinstance(dask_array_type), it could for check for "duck dask arrays" (e.g., those with both __array_function__ and __dask_graph__)?

Yes, great idea!

ngoldbaum commented 4 years ago

Hi all, I was just pointed at this by someone who went to the NumFOCUS summit. I'm the lead developer for unyt, which I see has come up a little bit. If anyone wants to chat with me about adding support for unyt along with Pint I'd be happy to discuss.

keewis commented 4 years ago

sorry for the late reply, @ngoldbaum. For xarray to support unyt, it would have to implement NEP-18 (i.e. __array_function__) which I think it does not yet? I would expect to have unyt support come for free once pint support works so I would focus on that first (see #3594 for the current progress). Extending the tests to also test unyt would be possible but I'm thinking it would be better to put that into the accessor package (as discussed above for a possible pint accessor)?

There were a few unresolved design questions with how unit libraries should implement certain numpy functions (e.g. how where should behave when receiving array(nan) and array(0) which xarray uses to implement nanops, or which unit the result of full_like should have): see hgrecco/pint#905. Any supported unit library would have to behave the same way so I think it would be good to coordinate that. Thoughts?

StanczakDominik commented 3 years ago

Hi! I'm just popping in as a very interested user of both xarray and unit packages to ask: since there's been some awesome progress made here and pint-xarray is now enough of A Thing to have documentation, though obviously experimental - how much work would you expect a corresponding package for astropy's Quantities to take, given the current state of things?

Are there any limitations that would prevent that? I saw the discussion above about Quantities being more problematic due to taking the subclass-from-numpy-arrays route, but I'm not sure how much of a roadblock that still is. I would suspect the API could be shared with pint-xarray (which, obviously, is experimental for now).

keewis commented 3 years ago

I saw the discussion above about Quantities being more problematic

I would expect astropy quantities to work just fine as long as they are duck arrays. Also, the only limitation I would expect them to have is that they can't "wrap" anything other than numpy arrays. dask arrays are an exception since they can wrap quantities using the _meta attribute (which is something we try to avoid with pint, though).

For reference, the currently remaining issues for all duck arrays (except obviously dask) are documented here

keewis commented 3 years ago

I would expect astropy quantities to work just fine as long as they are duck arrays

actually, that turns out to be wrong. Since isinstance(data, np.ndarray) returns true for astropy.units.Quantity, it is cast to ndarray using np.asarray: https://github.com/pydata/xarray/blob/7dbbdcafed7f796ab77039ff797bcd31d9185903/xarray/core/variable.py#L231-L245

Adding ~or issubclass(type(data), np.ndarray)~ or type(data) != np.ndarray does allow wrapping a astropy.units quantity in Dataset / DataArray objects ~but it breaks a few tests~. Also, unless we modify the testsuite in xarray/tests/test_units.py to run with astropy.units instead of pint I can't really tell which features of xarray strip the units (in addition to the ones documented for pint). For that, we probably need to somehow create a generalization of the tests for duck arrays.

riley-brady commented 2 years ago

I believe this can be closed via https://github.com/pydata/xarray/pull/5734 @andersy005?

tien-vo commented 1 week ago

https://github.com/pydata/xarray/blob/5fd50c501f4ff03067b5a87429033597e7efa02e/xarray/core/variable.py#L294-L331

@keewis it looks like adding an if block before line 323 would allow astropy quantities to be wrapped as you said, i.e.,

if isinstance(data, u.Quantity):
    return cast("T_DuckArray", data)

what else is needed to make this happen? I'd love to make some progress towards xarray wrapping astropy.units. Should #7799 be picked up again since @dstansby can't work on it anymore?

keewis commented 1 week ago

We've since made some progress with the automatic generation of tests for arbitrary duck arrays in xarray-array-testing, so it might be better to work on that instead of trying to generalize the massive (and probably stale) testsuite for testing unit support with pint.

That said, that's for testing the support, not the support itself. Implementing that should be fairly easy, we should just make sure to continue banning numpy.matrix and numpy.ma.masked_array (I think those were the ones we explicitly wanted to convert to numpy.ndarray in #2956). Either way, feel free to open a dedicated issue to track astropy.units support.

Edit: actually, numpy.ma.MaskedArray is special-cased immediately above the block for duck arrays, so I believe the only thing we have to make sure is to not pass through numpy.matrix instances.