Closed ilan-gold closed 7 months ago
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
@pydata/xarray Ilan's question seems interesting and potentially impactful if we can support his use case, but I don't know eough about categorical arrays to weigh in. Can anyone provide guidance?
It would be great to reuse the Arrow Dictionary concept here somehow.
@rabernat / @TomNicholas thanks for the replies :) @rabernat What concept here are you referring to?
Alternatively, could you write an array-api compliant container instead? This would wrap a (potentially lazy) numpy-like array of "codes" with dtype=np.intXX
and a CategoricalDtype
to preserve the mapping from codes to category labels.
We could "decode" netcdf Enum dtypes to this new container and also allow converting from CF-style "Flag Variables".
Supporting both netcdf enum and pandas.Categorical seem sensible to me, although I don't know if it will be easy to do so via a single array wrapper.
On a related note, I'm wondering if we shouldn't start thinking about how to generalize the approach based on ExplicitlyIndexedNDArrayMixin
that @ilan-gold describe above:
xarray.core.indexing
and expose (some of) them as public API such that these could be subclassed in 3rd-party libraries (related to #5081).pandas.ExtensionArray
and/or pyarrow.ExtensionArray
with Xarray.I guess we will need to address this if we want to keep Xarray up to date with Pandas switching backends from Numpy to Arrow? The question about how to work with Arrow arrays in Xarray has also been asked multiple times (AFAIK in Xarray geospatial extensions).
Alternatively, could you write an array-api compliant container instead? This would wrap a (potentially lazy) numpy-like array of "codes" with dtype=np.intXX and a CategoricalDtype to preserve the mapping from codes to category labels.
We could "decode" netcdf Enum dtypes to this new container and also allow converting from CF-style "Flag Variables".
Yes I will try to do that. If I run into any roadbumps, I will post here.
On a related note, I'm wondering if we shouldn't start thinking about how to generalize the approach based on ExplicitlyIndexedNDArrayMixin that @ilan-gold describe above:
I do think this would be good. I think what I have makes sense, but making it more explicit/public to end-users would be cool.
This is not the place for this but it is related: how do I encode the necessary information for encoding/decoding? From what I understand, this belongs in attrs
but that seems to be reserved for the returned value of ncattrs
.
In the docs, it seems that attrs
is the place that decode
should read from, and then move to encoding
: "Variable
attributes no more applicable after the decoding, are dropped and stored in the
Variable.encoding
to make them available to the encode
method, which
performs the inverse transformation."
Relatedly, how strictly should the information in attrs
or encoding
follow the netcdf
convention for encoding the enum type information? NetCDf stores this enum mapping as a dictionary on-disk and in-memory. But one think I think about is, for example in AnnData, that your categories might be stored in a zarr array (i.e., an array of strings where the mapping values are just the positional indices) and you don't want to have to read that in on load. So a dict for the enum type would not be ideal, and we would want something like:
categories = zarr.open_array('unique_categories')
encoding = {
...
'enum_dict': {
'from': list(range(len(categories))),
'to': categories
}
instead of
encoding = {
...
'enum_dict': dict(zip(zarr.open_array('unique_categories'), list(range(len(categories)))))
}
whch is what netcdf has (i.e., an on-disk dictionary, instead of a stored array) and would force the user to read in all the categories data as well as the code data. But this might be too big of an ask.
So I have started working on a branch, but have come across a bit of an issue. When the Dataset
is created, it loses any reference to the CategoricalArray
- this is a problem for serialization because you need a reference to the underlying integer (or whatever dtype) arrays to write out. For example, xr_ds
here
import netCDF4 as nc
import xarray as xr
import numpy as np
ds = nc.Dataset("mre.nc", "w", format="NETCDF4")
cloud_type_enum = ds.createEnumType(int,"cloud_type",{"clear":0, "cloudy":1}) #
ds.createDimension("time", size=(10))
x = np.arange(10)
ds.createVariable("x", np.int32, dimensions=("time",))
ds.variables["x"][:] = x
# {'cloud_type': <class 'netCDF4._netCDF4.EnumType'>: name = 'cloud_type', numpy dtype = int64, fields/values ={'clear': 0, 'cloudy': 1}}
ds.createVariable("cloud", cloud_type_enum, dimensions=("time",))
ds["cloud"][:] = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]
ds.close()
# -- Open dataset with xarray
xr_ds = xr.open_dataset("./mre.nc")
with my branch lacks any reference to CategoricalArray
, which is similar to e.g., BoolTypeArray
, even though the Dataset
was initiated with it.
This is something that was quite surprising, actually - there were 4 or more reconstructions of the Variable
object containing the codes and each time data
got more complicated (although still containing some reference to CategoricalArray
) until the final Variable
completely lacks a way to recover the CategoricalArray
. For example, calling xr.open_dataset("./mre.nc")
has Variable
constructions for CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False)
, LazilyIndexedArray(array=CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), key=BasicIndexer((slice(None, None, None),)))
and then LazilyIndexedArray(array=CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), key=BasicIndexer((slice(None, None, None),)))
again three times, but xr_ds['cloud'].variable._data
is MemoryCachedArray(array=NumpyIndexingAdapter(array=array(['cloudy', 'clear', 'cloudy', 'clear', 'cloudy', 'clear', 'cloudy', 'clear', 'clear', 'cloudy'], dtype='<U6')))
. This lacks a reference to the original data.
In terms of investigating, there seems to be something going on here where backend_ds
in this line has a reference to CategoricalArray
via backend_ds['cloud'].variable._data
(which isn't amazing but something) but then loses it somehow once you step into the next function, _dataset_from_backend_dataset
!
TL;DR I will keep digging but if anyone has any quick suggestions of how to keep a reference to the underlying object CategoricalArray
when constructing a Dataset
for NetCDF datasets, let me know. In writing this comment, I discovered that this is not a general problem i.e., the following works well:
import numpy as np
import xarray as xr
codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr, fastpath=True)
ds = xr.Dataset({'cloud': v})
ds['cloud']._variable._data
# CategoricalArray(codes=..., categories={0: 'foo', 1: 'jazz', 2: 'bar'}, ordered=False)
So perhaps it would make sense to release this as a general feature first instead of focusing on NetCDF. Perhaps that is a good place to start for a PR
@ilan-gold Sorry, just saw this popping up in my inbox. There is work to read/write netCDF4.EnumType in #8147. This PR is close to be finished.
The proposed solution is to add metadata to `encoding["dtype"]:
if isinstance(var.datatype, netCDF4.EnumType):
encoding["dtype"] = np.dtype(
data.dtype,
metadata={
"enum": var.datatype.enum_dict,
"enum_name": var.datatype.name,
},
)
This follows what h5py
is doing, beside adding the enum_name
. Could this be used here?
For background, these various arrays will all get lost on data load, which I think is what you're discovering.
I would instead work on a CategoricalArray
that followed the array API. Once that works, we can discussing lazy loading and decoding to that array type. Potentially, this could be built as a numpy custom dtype but I haven't followed the progress there.
@dcherian That sounds like a plan to me. I will check that out.
@kmuehlbauer Thanks for this, that is good to see.
So I have spent some time looking into this. @kmuehlbauer Thanks for directing me to that PR, not sure how I missed it. This PR is interesting and contains some good stuff. It's definitely a step in the right direction.
@dcherian I don't think an array-api
compliant categorical container is possible, unless we implement array methods (i.e., xp.exp
) directly on the codes, which seems not good. So if we're not going to do that, trying to create an array-api
compliant container for categoricals then reduces down to the problem of creating an array-api
container for strings since this is the array that __getitem__
on some categorical container should return. Creating an array-api
compliant container for strings, though, seems out of scope (the thread below this comment should explain more).
Separately, it still seems strange to me that one can write code that declares a Variable
object during i/o process and then lose that object or any reference to it or its constituent parts by the end of the process. This being said, as I noted, the following works in that using ds
doesn't seem to error out in any horrifying way with normal usage:
import numpy as np
import xarray as xr
codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr)
ds = xr.Dataset({'cloud': v})
# Some simple relevant operations, there are probably others but these are what I came up with quickly to test
ds['cloud']._variable._data # This is a CategoricalArray
ds.where(ds.cloud == 1) # all nans as expected even though this is the correct code
ds.where(ds.cloud == "jazz") # correct masking
ds.where(ds.cloud.isin(["jazz", "bar"])) # more correct masking, but more complicated
ds.groupby("cloud") # three groups as expected
The current status of CategoricalArray
is that it inherits from ExplicitlyIndexedNDArrayMixin
, which means that everything is kosher staying within xarray
and normal development patterns (from what I can tell).
Would a PR that tests this array and its behavior with Variable
/Dataset
be acceptable? We can punt on i/o. This could give people the opportunity to work with the object before integrating into officially supported i/o.
Concretely, I could see a PR containing CategoricalArray
alone (and tests), and then a separate PR that adds some sort of _writeable_data
attribute to Variable
or DataArray
for i/o, which would allow us to store a reference to the codes there. The encoding for the categories could then be stored as in #8147 or something similar.
Does this seem feasible?
P.S Another step along this path could be adding #5287 for pandas extension array support, which would give us Categorical support because pandas' categorical array is an extension array. But I'm not sure this really changes the roadmap I've laid out beyond changing the return type of our categorical array wrapper. It's tough to say without seeing #5287 implemented.
AFAICT you're subclassing from ExplicitlyIndexedArrayMixin
to bypass our checks for what can go in an Xarray object. The purpose of these classes is to enable lazy indexing of data from disk; and to abstract away gory indexing-support details for internal use.
If you satisfy this duck array check, you won't need to do that: https://github.com/pydata/xarray/blob/33d51c8da3acbdbe550e496a006184bda3de3beb/xarray/core/utils.py#L260-L270
I agree that Array API compliance probably makes no sense. Would it make sense to support __array_function__
and __array_ufunc__
. IIUC you are allowed to return NotImplementedError
from these, so adding those two methods would let Xarray know that CategoricalArray
is a real array.
I agree that writing a NEP-18 wrapper for Pandas Extension arrays may be quite valuable but it doesn. It would allow IntervalArray
too but perhaps only <=2D (?).
We discussed this at a meeting today.
NotImplementedError
from a bunch of methods that don't make sense. This seems not ideal.__array_function__
and __array_ufunc__
overrides. These are free to return NotImplementedError
, so might fit best.Thanks @dcherian . I will look into these. I think 1+3 from your above points go together, and form one concrete path forward. Point 4 is a separate path forward from what I can tell. I think I will look into the fourth point closely then because it seems to offer the best way forward IMO in terms of generalization, unless this is not an issue (#5287) that is seen as important. If point 4 also goes belly up, we will need to fall back to a combination of 1+3.
And I am strongly opposed to point 2 for a variety of reasons, especially now that I have looked into it.
Is your feature request related to a problem?
We are looking to improve compatibility between
AnnData
andxarray
(see https://github.com/scverse/anndata/issues/744), and so categoricals are naturally on our roadmap. Thus, I think some sort of standard-use categoricals array would be desirable. It seems something similar has come up with netCDF, although my knowledge is limited so this issue may be more distinct than I am aware. So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (AnnData
andnetCDF
ENUM
).Describe the solution you'd like
The goal would be a standard-use categorical data type
xarray
container of some sort. I'm not sure what form this can take.We have something functional here that inherits from
ExplicitlyIndexedNDArrayMixin
and returnspandas.CategoricalDtype
. So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done here for new CF types).Some issues:
xarray
categorical array should be (i.e.,numpy
with the categories applied,pandas
, something custom etc.). So I'm not sure if usingpandas.CategoricalDtype
type is acceptable as In do in the linked implementation. Relatedly....pandas.CategoricalDtype
really helps with the already existing CF Enum need if you want to have the return type be some sort ofnumpy
array (although again, not sure about the return type). As I understand it, though, the whole point of categoricals is to useintegers
as the base type and then only show "strings" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers. So I'm not really surenumpy
is even an option here. Maybe we roll our own solution?Variable
? I don't think so, but I am just a beginner here 😄 )It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?).
Describe alternatives you've considered
I think there is some route via
VariableCoder
as hinted here i.e., usingencode
/decode
. This would probably be more general purpose as we could encode directly to other data types if usingpandas
is not desirable. Maybe this would be a way to support bothnetCDF
and returning apandas.CategoricalDtype
(again, not sure what thenetCDF
return type should be forENUM
).Additional context
So just for reference, the current behavior of
to_xarray
withpandas.CategoricalDtype
isobject
dtype
fromnumpy
:And as stated in the
netCDF
issue, for that use-case, the information aboutENUM
is lost (from what I can read).Apologies if I'm missing something here! Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.