pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.06k forks source link

Provide protocols for creating structural subtypes of DataArray/Dataset #6462

Open rsokl opened 2 years ago

rsokl commented 2 years ago

Is your feature request related to a problem?

I frequently find myself wanting to annotate functions in terms of xarray objects that adhere to a particular schema. Given that a dataset's adherence to a schema is a matter of its structure/contents, it is unnatural to try to describe a schema as a subtype of xr.Dataset (or DataArray) (i.e. a type-checker ought not care that a dataset is an instance of a specific subclass of Dataset).

Describe the solution you'd like

Instead, it would be ideal to define a schema as a Protocol (structural subtype) of xr.Dataset. Unfortunately, one cannot subclass a normal class to create a protocol.

Thus, I am proposing that xarray provide Protocol-based descriptions of DataArray and Dataset so that users can describe schemas as structural subtypes of these classes. E.g.

from typing import Protocol

from xarray import DataArray
from xarray.typing import DatasetProtocol

class ClimateData(DatasetProtocol, Protocol):
    lat: DataArray
    lon: DataArray
    temp: DataArray
    precip: DataArray

def process_climate_data(ds: ClimateData):
    ds.banana  # type checker flags as unknown attribute
    ds.temp  # type checker sees "DataArray" (as informed by ClimateData)
    ds.sel(lat=1.0)  # type checker sees `Dataset` (as informed by `DatasetProtocol`)

The contents of DatasetProtocol would essentially look like a modified type stub for xarray.Dataset so the implementation details are relatively simple, I believe.

Describe alternatives you've considered

Creating a strict subtype of Dataset is not ideal for a few reasons:

  1. Static type checkers would then expect to see that datasets must derive from that particular subclass, which is generally not the case.
  2. The annotations / design of xarray.Dataset is too broad for describing a schema. E.g. the presence of __getattr__ prevents type checkers from flagging access to non-existent data variables and coordinates during static analysis. DatasetProtocol would need to be designed to be less permissive than this.

Additional context

Hopefully this could be leveraged by the likes of xarray-schema so that xarray schemas can be used to provide both runtime and static validation capabilities.

I'd love to get feedback on this, and would be happy to open a PR if xarray devs are willing to weigh in on the design of these protocols.

paiforsyth commented 2 years ago

I can't comment on the implementation, but this is certainly a feature I would appreciate. I have been using xarray a lot recently, and have found myself frequently writing functions that map Datasets to Datasets. The ability to encode the expected schemas in type annotations would be very valuable for correctness and documentation.

max-sixty commented 2 years ago

Sorry not to respond for a few days. I think this is a good idea if there's demand for it. It would require a small amount of maintenance as the dataset implementation changes, but very manageable.

FWIW I frequently find myself having a ds being passing around in an app, and it would be awesome to be able to define and statically check the data variables.

(We've discussed also being able to define and statically check the dimensions, that's a separate issue though)

chrissype commented 1 year ago

I'm considering leveraging the power of xarray to greatly simplify a codebase that has its own types that essentially implement a very poor version of xarray's functionality.

However to be able to justify integrating it into a large codebase with multiple developers, type hints for linting, autocomplete, and (possibly) static type checking are completely non-optional.

Adding this functionality to xarray would make it a shoo-in, and I believe the approach suggested by @rsokl is probably the best.

lukasbindreiter commented 11 months ago

Does this package fulfill this requirement?

https://pypi.org/project/xarray-dataclasses/

rsokl commented 11 months ago

Does this package fulfill this requirement?

This looked promising at a glance, but I think they have gotten the design wrong in very bad ways. Consider the following:

from dataclasses import dataclass
from xarray_dataclasses import AsDataArra

@dataclass
class Image(AsDataArray):
    """2D image as DataArray."""
    ...

def process_image(x: Image):
    ...

If you use Image's array-creation methods, you do not get an Image back in the eyes of a static type checker:

img = Image.ones((3, 3))

process_image(img)  # pyright error: "DataArray" is incompatible with "Image"

"No problem, I'll just assert isinstance(img, Image) to narrow the type and then pass it to my function 😄 "

>>> isinstance(img, Image)
False

WHAT!?

So... we have this nice, structured Image type, but we cannot use the type for static type checking or runtime type checking.

Also the very first example in their readme doesn't type check: Image.new([[0, 1], [2, 3]], x=[0, 1], y=[0, 1]) # type "list[list[int]]" cannot be assigned to parameter "data" of type "Data[tuple[X, Y], float]".

Ultimately AsDataArray needs its array-creation methods to return instances of itself (e.g. Image.new(...) should return an instance of Image), and it needs to be a subtype of xarray.DataArray so that these instances can be passed to generic xarray interfaces. xarray-datasets seems like a nice approach to designing these interfaces, but they need to think a lot more carefully about how they can facilitate workflows that rely on type checking.