pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

NamedArray tracking issue #8238

Open dcherian opened 1 year ago

dcherian commented 1 year ago

@andersy005 I think it would be good to keep a running list of NamedArray tasks. I'll start with a rough sketch, please update/edit as you like.

xref #3981

headtr1ck commented 1 year ago

For the operators we will run into the following issue:

Currently Variable + DataArray will return a DataArray because Variable will return NotImplemented and then python calls DataArray.__radd__. This is hard coded.

If you want to make NamedArray independent from xarray, you need to find a way to solve this. I assume numpy has the same issues, maybe check over there how it works?

dcherian commented 1 year ago

@headtr1ck or @Illviljan would one of you be able to handle moving over unary and binary ops and maybe even nanops. I think that might involve a bunch of typing wrangling that you two would be best suited to fix

headtr1ck commented 1 year ago

I can have a look at it :) By the way: do we want to get rid of the fastpath argument to the __init__ and replace it with a custom classmethod? I think it is confusing to have this in a public interface.

Illviljan commented 1 year ago

How much should NamedArray conform to the array api standard?

For example, the array api does not implement the methods, .imag, .real, .astype. https://data-apis.org/array-api/latest/API_specification/index.html

I'm leaning towards being strict with the array api and to remove them because they are causing some truly difficult typing issues (#8281, #8294). If all these shape/dim/dtype-changing methods were moved to functions instead the type hinting would be much more straightforward.

dcherian commented 1 year ago

@TomNicholas do you have time to move over parallelcompat.py to namedarray/? It'll be a lot quicker if you did it ;)

dcherian commented 1 year ago

How much should NamedArray conform to the array api standard?

The API is certainly going to be a superset of the array API (that doesn't even have NaN skipping reductions!)

I'm leaning towards being strict with the array api and to remove them because they are causing some truly difficult typing issues

I don't think typing should limit the API we offer.

For example, the array api does not implement the methods, .imag, .real, .astype. data-apis.org/array-api/latest/API_specification/index.html

Can we add these back please? As of now, our plan is to include them. https://github.com/pydata/xarray/blob/main/design_notes/named_array_design_doc.md#attributes-to-be-preserved-from-xarrayvariable

TomNicholas commented 1 year ago

The API is certainly going to be a superset of the array API (that doesn't even have NaN skipping reductions!)

We should also remember that the array API is not fixed in stone - it can and already has changed to include more functions. We should be raising issues there to get functions we want added. See https://github.com/data-apis/array-api/issues/187#issuecomment-1553615779 and https://github.com/data-apis/array-api/issues/669 for example.

I don't think typing should limit the API we offer.

+1

TomNicholas do you have time to move over parallelcompat.py to namedarray/? It'll be a lot quicker if you did it ;)

I'll have a look now!

headtr1ck commented 1 year ago

We should slow down with the PRs, haha. The merge conflicts will be a nightmare

kgryte commented 1 year ago

Hey all! Don't want to hijack this thread, but wanted to respond to a few of the comments above regarding the Array API.

For example, the array api does not implement the methods, .imag, .real, .astype. data-apis.org/array-api/latest/API_specification/index.html

@Illviljan In the Array API standard, we've preferred functional APIs over methods and have the equivalent of imag(), real(), and astype().

The API is certainly going to be a superset of the array API (that doesn't even have NaN skipping reductions!)

@dcherian There has been some discussion about adding support for NaN reductions to the specification (see https://github.com/data-apis/array-api/issues/621); however, there is still some debate as to what form that should take. E.g., whether we actually want nan* variants, a single NaN reduction wrapper, or kwarg support to existing reductions. You're encouraged to participate in that thread to help provide input regarding what makes the most sense from your POV.

We should also remember that the array API is not fixed in stone - it can and already has changed to include more functions. We should be raising issues there to get functions we want added.

@TomNicholas Indeed! May be good to list all the current Array API pain points (if not done already), so that we can potentially prioritize. We could also potentially arrange for xarray devs to present at one of the Consortium workgroup meetings.

Regardless, thanks for considering the Array API standard. We'd love to get additional feedback on the specification over on the specification repo. :)

maxrjones commented 11 months ago

@andersy005 will there be a function similar to as_compatible_data in NamedArray? I'm asking because this affects what tests would be expected to pass/fail in NamedArray vs. Variable (xref https://github.com/pydata/xarray/issues/8370). For example, as_compatible_data currently converts lists to numpy arrays such that https://github.com/pydata/xarray/blob/72abfdf9bd6aef99c0397f6dadc383c28abc6ce0/xarray/tests/test_namedarray.py#L73-L75 raises an AttributeError for NamedArray but not Variable. It would be helpful to better understand whether this will always be the case.

andersy005 commented 11 months ago

@andersy005 will there be a function similar to as_compatible_data in NamedArray? I'm asking because this affects what tests would be expected to pass/fail in NamedArray vs. Variable (xref #8370). For example, as_compatible_data currently converts lists to numpy arrays such that

@maxrjones, currently, the constructor of NamedArray.__init__() strictly requires passing an array-like object but we have had discussions regarding reintroducing the as_compatible_data function in the NamedArray.__init__() method, along with the NamedArray.from_array() method. .from_array() method would allow for skipping certain compatibility checks and normalization steps performed in the as_compatible_data() function: https://github.com/pydata/xarray/pull/8294#discussion_r1365843106.

maxrjones commented 11 months ago

@andersy005 will there be a function similar to as_compatible_data in NamedArray? I'm asking because this affects what tests would be expected to pass/fail in NamedArray vs. Variable (xref #8370). For example, as_compatible_data currently converts lists to numpy arrays such that

@maxrjones, currently, the constructor of NamedArray.__init__() strictly requires passing an array-like object but we have had discussions regarding reintroducing the as_compatible_data function in the NamedArray.__init__() method, along with the NamedArray.from_array() method. .from_array() method would allow for skipping certain compatibility checks and normalization steps performed in the as_compatible_data() function: #8294 (comment).

This is helpful, thank you! I will proceed by marking these as xfail for now, so it would be easy to test in NamedArray if added in the future.

maxrjones commented 1 month ago

@andersy005 @scottyhq @dcherian I'm at the NumFOCUS project summit and recently spoke with a masters student in physical oceanography who chose to use NumPy only rather than Xarray for her code because of the performance overhead for xarray. There may be computational patterns that would make Xarray workable, but this also could be a good use case to test out NamedArray on a real-work science case when it comes time to demonstrate the utility of NamedArray.

andersy005 commented 1 month ago

thank you for the update, @maxrjones! @dcherian and I have discussed putting together a write-up/post that provides a clear overview of the current state of NamedArray, and highlighting the next steps needed to bring it to completion. In the coming weeks, I’m going to take a stab at this post.

It would be extremely useful to connect with users, especially those interested in using NamedArray features outside of Xarray, which could shape the next phases of development.

TomNicholas commented 1 month ago

recently spoke with a masters student in physical oceanography who chose to use NumPy only rather than Xarray for her code because of the performance overhead for xarray.

FYI after speaking with this student I think that student's troubles were really a classic case of difficulty configuring dask for a larger-than-memory workload, nothing to do with overhead of xarray or NamedArray.

a write-up/post that provides a clear overview of the current state of NamedArray, and highlighting the next steps needed to bring it to completio

We do need something like this. I've had multiple conversations at this summit with people who are interested in named-array-like features, but I'm not sure what to tell them in terms of readiness. (see for example https://github.com/pydims/pydims)

TomNicholas commented 1 month ago

Also I don't know where the correct issue for this is but there is a strong case made here that a minimal implementation like NamedArray should not carry around arbitrary attrs.