pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.09k forks source link

Categorical Array #8463

Closed ilan-gold closed 7 months ago

ilan-gold commented 1 year ago

Is your feature request related to a problem?

We are looking to improve compatibility between AnnData and xarray (see https://github.com/scverse/anndata/issues/744), and so categoricals are naturally on our roadmap. Thus, I think some sort of standard-use categoricals array would be desirable. It seems something similar has come up with netCDF, although my knowledge is limited so this issue may be more distinct than I am aware. So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (AnnData and netCDF ENUM).

Describe the solution you'd like

The goal would be a standard-use categorical data type xarray container of some sort. I'm not sure what form this can take.

We have something functional here that inherits from ExplicitlyIndexedNDArrayMixin and returns pandas.CategoricalDtype. So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done here for new CF types).

Some issues:

  1. I have no idea what a standard "return type" for an xarray categorical array should be (i.e., numpy with the categories applied, pandas, something custom etc.). So I'm not sure if using pandas.CategoricalDtype type is acceptable as In do in the linked implementation. Relatedly....
  2. I don't think using pandas.CategoricalDtype really helps with the already existing CF Enum need if you want to have the return type be some sort of numpy array (although again, not sure about the return type). As I understand it, though, the whole point of categoricals is to use integers as the base type and then only show "strings" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers. So I'm not really sure numpy is even an option here. Maybe we roll our own solution?
  3. I am not sure this is the right level at which to implement this (maybe it should be a Variable? I don't think so, but I am just a beginner here 😄 )

It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?).

Describe alternatives you've considered

I think there is some route via VariableCoder as hinted here i.e., using encode/decode. This would probably be more general purpose as we could encode directly to other data types if using pandas is not desirable. Maybe this would be a way to support both netCDF and returning a pandas.CategoricalDtype (again, not sure what the netCDF return type should be for ENUM).

Additional context

So just for reference, the current behavior of to_xarray with pandas.CategoricalDtype is object dtype from numpy:

import pandas as pd
df = pd.DataFrame({'cat': ['a', 'b', 'a', 'b', 'c']})
df['cat'] = df['cat'].astype('category')
 df.to_xarray()['cat']
# <xarray.DataArray 'cat' (index: 5)>
# array(['a', 'b', 'a', 'b', 'c'], dtype=object)
# Coordinates:
#   * index    (index) int64 0 1 2 3 4

And as stated in the netCDF issue, for that use-case, the information about ENUM is lost (from what I can read).

Apologies if I'm missing something here! Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.

welcome[bot] commented 1 year ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

TomNicholas commented 11 months ago

@pydata/xarray Ilan's question seems interesting and potentially impactful if we can support his use case, but I don't know eough about categorical arrays to weigh in. Can anyone provide guidance?

rabernat commented 11 months ago

It would be great to reuse the Arrow Dictionary concept here somehow.

ilan-gold commented 11 months ago

@rabernat / @TomNicholas thanks for the replies :) @rabernat What concept here are you referring to?

dcherian commented 11 months ago

Alternatively, could you write an array-api compliant container instead? This would wrap a (potentially lazy) numpy-like array of "codes" with dtype=np.intXX and a CategoricalDtype to preserve the mapping from codes to category labels.

We could "decode" netcdf Enum dtypes to this new container and also allow converting from CF-style "Flag Variables".

benbovy commented 11 months ago

Supporting both netcdf enum and pandas.Categorical seem sensible to me, although I don't know if it will be easy to do so via a single array wrapper.

On a related note, I'm wondering if we shouldn't start thinking about how to generalize the approach based on ExplicitlyIndexedNDArrayMixin that @ilan-gold describe above:

  1. maybe refactor the indexing classes in xarray.core.indexing and expose (some of) them as public API such that these could be subclassed in 3rd-party libraries (related to #5081).
  2. provide built-in helpers to facilitate and uniformize integration of any pandas.ExtensionArray and/or pyarrow.ExtensionArray with Xarray.

I guess we will need to address this if we want to keep Xarray up to date with Pandas switching backends from Numpy to Arrow? The question about how to work with Arrow arrays in Xarray has also been asked multiple times (AFAIK in Xarray geospatial extensions).

ilan-gold commented 11 months ago

Alternatively, could you write an array-api compliant container instead? This would wrap a (potentially lazy) numpy-like array of "codes" with dtype=np.intXX and a CategoricalDtype to preserve the mapping from codes to category labels.

We could "decode" netcdf Enum dtypes to this new container and also allow converting from CF-style "Flag Variables".

Yes I will try to do that. If I run into any roadbumps, I will post here.

On a related note, I'm wondering if we shouldn't start thinking about how to generalize the approach based on ExplicitlyIndexedNDArrayMixin that @ilan-gold describe above:

I do think this would be good. I think what I have makes sense, but making it more explicit/public to end-users would be cool.

ilan-gold commented 11 months ago

This is not the place for this but it is related: how do I encode the necessary information for encoding/decoding? From what I understand, this belongs in attrs but that seems to be reserved for the returned value of ncattrs.

In the docs, it seems that attrs is the place that decode should read from, and then move to encoding: "Variable attributes no more applicable after the decoding, are dropped and stored in the Variable.encoding to make them available to the encode method, which performs the inverse transformation."

ilan-gold commented 11 months ago

Relatedly, how strictly should the information in attrs or encoding follow the netcdf convention for encoding the enum type information? NetCDf stores this enum mapping as a dictionary on-disk and in-memory. But one think I think about is, for example in AnnData, that your categories might be stored in a zarr array (i.e., an array of strings where the mapping values are just the positional indices) and you don't want to have to read that in on load. So a dict for the enum type would not be ideal, and we would want something like:

categories = zarr.open_array('unique_categories')
encoding = {
    ...
    'enum_dict': {
        'from': list(range(len(categories))),
        'to': categories
}

instead of

encoding = {
    ...
    'enum_dict': dict(zip(zarr.open_array('unique_categories'), list(range(len(categories))))) 
}

whch is what netcdf has (i.e., an on-disk dictionary, instead of a stored array) and would force the user to read in all the categories data as well as the code data. But this might be too big of an ask.

ilan-gold commented 10 months ago

So I have started working on a branch, but have come across a bit of an issue. When the Dataset is created, it loses any reference to the CategoricalArray - this is a problem for serialization because you need a reference to the underlying integer (or whatever dtype) arrays to write out. For example, xr_ds here

import netCDF4 as nc
import xarray as xr
import numpy as np

ds = nc.Dataset("mre.nc", "w", format="NETCDF4")   
cloud_type_enum = ds.createEnumType(int,"cloud_type",{"clear":0, "cloudy":1}) #
ds.createDimension("time", size=(10))
x = np.arange(10)
ds.createVariable("x", np.int32, dimensions=("time",))
ds.variables["x"][:] = x
# {'cloud_type': <class 'netCDF4._netCDF4.EnumType'>: name = 'cloud_type', numpy dtype = int64, fields/values ={'clear': 0, 'cloudy': 1}} 
ds.createVariable("cloud", cloud_type_enum,  dimensions=("time",))
ds["cloud"][:] = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]
ds.close()

# -- Open dataset with xarray
xr_ds = xr.open_dataset("./mre.nc")

with my branch lacks any reference to CategoricalArray, which is similar to e.g., BoolTypeArray, even though the Dataset was initiated with it.

This is something that was quite surprising, actually - there were 4 or more reconstructions of the Variable object containing the codes and each time data got more complicated (although still containing some reference to CategoricalArray) until the final Variable completely lacks a way to recover the CategoricalArray. For example, calling xr.open_dataset("./mre.nc") has Variable constructions for CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), LazilyIndexedArray(array=CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), key=BasicIndexer((slice(None, None, None),))) and then LazilyIndexedArray(array=CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), key=BasicIndexer((slice(None, None, None),))) again three times, but xr_ds['cloud'].variable._data is MemoryCachedArray(array=NumpyIndexingAdapter(array=array(['cloudy', 'clear', 'cloudy', 'clear', 'cloudy', 'clear', 'cloudy', 'clear', 'clear', 'cloudy'], dtype='<U6'))). This lacks a reference to the original data.

In terms of investigating, there seems to be something going on here where backend_ds in this line has a reference to CategoricalArray via backend_ds['cloud'].variable._data (which isn't amazing but something) but then loses it somehow once you step into the next function, _dataset_from_backend_dataset!


TL;DR I will keep digging but if anyone has any quick suggestions of how to keep a reference to the underlying object CategoricalArray when constructing a Dataset for NetCDF datasets, let me know. In writing this comment, I discovered that this is not a general problem i.e., the following works well:

import numpy as np
import xarray as xr

codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr, fastpath=True)
ds = xr.Dataset({'cloud': v})
ds['cloud']._variable._data
# CategoricalArray(codes=..., categories={0: 'foo', 1: 'jazz', 2: 'bar'}, ordered=False)

So perhaps it would make sense to release this as a general feature first instead of focusing on NetCDF. Perhaps that is a good place to start for a PR

kmuehlbauer commented 10 months ago

@ilan-gold Sorry, just saw this popping up in my inbox. There is work to read/write netCDF4.EnumType in #8147. This PR is close to be finished.

The proposed solution is to add metadata to `encoding["dtype"]:

        if isinstance(var.datatype, netCDF4.EnumType):
            encoding["dtype"] = np.dtype(
                data.dtype,
                metadata={
                    "enum": var.datatype.enum_dict,
                    "enum_name": var.datatype.name,
                },
            )

This follows what h5py is doing, beside adding the enum_name. Could this be used here?

dcherian commented 10 months ago

For background, these various arrays will all get lost on data load, which I think is what you're discovering.

I would instead work on a CategoricalArray that followed the array API. Once that works, we can discussing lazy loading and decoding to that array type. Potentially, this could be built as a numpy custom dtype but I haven't followed the progress there.

ilan-gold commented 10 months ago

@dcherian That sounds like a plan to me. I will check that out.

ilan-gold commented 10 months ago

@kmuehlbauer Thanks for this, that is good to see.

ilan-gold commented 10 months ago

So I have spent some time looking into this. @kmuehlbauer Thanks for directing me to that PR, not sure how I missed it. This PR is interesting and contains some good stuff. It's definitely a step in the right direction.

@dcherian I don't think an array-api compliant categorical container is possible, unless we implement array methods (i.e., xp.exp) directly on the codes, which seems not good. So if we're not going to do that, trying to create an array-api compliant container for categoricals then reduces down to the problem of creating an array-api container for strings since this is the array that __getitem__ on some categorical container should return. Creating an array-api compliant container for strings, though, seems out of scope (the thread below this comment should explain more).

Separately, it still seems strange to me that one can write code that declares a Variable object during i/o process and then lose that object or any reference to it or its constituent parts by the end of the process. This being said, as I noted, the following works in that using ds doesn't seem to error out in any horrifying way with normal usage:

import numpy as np
import xarray as xr

codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr)
ds = xr.Dataset({'cloud': v})

# Some simple relevant operations, there are probably others but these are what I came up with quickly to test
ds['cloud']._variable._data # This is a CategoricalArray
ds.where(ds.cloud == 1) # all nans as expected even though this is the correct code
ds.where(ds.cloud == "jazz") # correct masking
ds.where(ds.cloud.isin(["jazz", "bar"])) # more correct masking, but more complicated
ds.groupby("cloud") # three groups as expected

The current status of CategoricalArray is that it inherits from ExplicitlyIndexedNDArrayMixin, which means that everything is kosher staying within xarray and normal development patterns (from what I can tell).

Would a PR that tests this array and its behavior with Variable/Dataset be acceptable? We can punt on i/o. This could give people the opportunity to work with the object before integrating into officially supported i/o.

Concretely, I could see a PR containing CategoricalArray alone (and tests), and then a separate PR that adds some sort of _writeable_data attribute to Variable or DataArray for i/o, which would allow us to store a reference to the codes there. The encoding for the categories could then be stored as in #8147 or something similar.

Does this seem feasible?

ilan-gold commented 10 months ago

P.S Another step along this path could be adding #5287 for pandas extension array support, which would give us Categorical support because pandas' categorical array is an extension array. But I'm not sure this really changes the roadmap I've laid out beyond changing the return type of our categorical array wrapper. It's tough to say without seeing #5287 implemented.

dcherian commented 10 months ago

AFAICT you're subclassing from ExplicitlyIndexedArrayMixin to bypass our checks for what can go in an Xarray object. The purpose of these classes is to enable lazy indexing of data from disk; and to abstract away gory indexing-support details for internal use.

If you satisfy this duck array check, you won't need to do that: https://github.com/pydata/xarray/blob/33d51c8da3acbdbe550e496a006184bda3de3beb/xarray/core/utils.py#L260-L270

I agree that Array API compliance probably makes no sense. Would it make sense to support __array_function__ and __array_ufunc__. IIUC you are allowed to return NotImplementedError from these, so adding those two methods would let Xarray know that CategoricalArray is a real array.

I agree that writing a NEP-18 wrapper for Pandas Extension arrays may be quite valuable but it doesn. It would allow IntervalArray too but perhaps only <=2D (?).

dcherian commented 10 months ago

We discussed this at a meeting today.

ilan-gold commented 10 months ago

Thanks @dcherian . I will look into these. I think 1+3 from your above points go together, and form one concrete path forward. Point 4 is a separate path forward from what I can tell. I think I will look into the fourth point closely then because it seems to offer the best way forward IMO in terms of generalization, unless this is not an issue (#5287) that is seen as important. If point 4 also goes belly up, we will need to fall back to a combination of 1+3.

And I am strongly opposed to point 2 for a variety of reasons, especially now that I have looked into it.