Support for Awkward Arrays in AnnData's X

scverse / anndata

Annotated data.

http://anndata.readthedocs.io

BSD 3-Clause "New" or "Revised" License

578 stars 154 forks source link

Support for Awkward Arrays in AnnData's X #1235

Open xinyuejohn opened 1 year ago

xinyuejohn commented 1 year ago

Please describe your wishes and possible alternatives to achieve the desired result.

I'm thrilled to see that AnnData now supports awkward arrays. This feature has been incredibly useful. I'd like to inquire if there are plans to extend this support to the X of AnnData. Implementing this would significantly benefit our ongoing projects with ehrapy 2.0 (https://github.com/theislab/ehrapy) and EHRData.

To explain further, in our current use of AnnData with ehrapy, each patient is represented as a row with several variables. However, as shown in the figure below, some of the variables couldn't be fit into current X (numpy array) because they are lists-of-lists or lists-of-dicts. But users expect processing on these data, for example, getting statistics (min/max/avg), perform imputation, etc. So we don't want to save these variables in .layers, .obsm, or in .varm. Because it is not user-friendly and adds complexity to integrating this data into computational workflow.

Is there an estimated timeline for when we might expect this feature? Thanks for your continuous efforts in improving AnnData!

flying-sheep commented 12 months ago

Hi! Thanks for the feature request. I think that’s feasible, but I need to discuss this with @ivirshup and @ilan-gold. We need to formalize what the supported array types in all of anndata’s fields are.

grst commented 12 months ago

I had hoped that this gets eventually solved with #244.

Back in the PR that introduced awkward arrays, we decided against implementing it in X (for now) as it would have required duplication of a lot of custom code. Checking the constraints on X is already a huge mess and adding the checks for awkward arrays makes it worse.

Personally, I'd suggest you set adata.X = None and just put it in a layer.

Zethson commented 11 months ago

@grst

Personally, I'd suggest you set adata.X = None and just put it in a layer.

This'd mean that people that load in complex EHR data will have an "empty" object. Yeah, everything is in a layer, but one needs to either always use the layer argument when doing stuff with it or copy it to X which err doesn't work. Just not the nicest experience.

It'd also deviate from the rest of the scverse workflows where the working data is usually in X.

I want everything in layers but scverse is not there yet.

grst commented 11 months ago

It'd also deviate from the rest of the scverse workflows where the working data is usually in X.

In scirpy, X is empty by default (unless you store paired gene expression in the same AnnData object, which is not recommended in favor of MuData). The TCR data is in .obsm.

It of course depends on your interface, but at least in the scirpy case only very advanced users would want to interact with the awkward array directly. All others only access it through scirpy API calls (including a get function to retreive some variables) and there you can just set appropriate default to get it from a layer or obsm.

grst commented 11 months ago

I want everything in layers but scverse is not there yet.

and why repeat the old mistake for new packages

ivirshup commented 11 months ago

I would suggest you try working with it in layers for now too. Most scverse workflows assume the data is in X, but also most scverse workflows assume that X and layers contain matrix-like arrays with homogenous dtypes.

I would be interested in hearing how this goes.

Zethson commented 11 months ago

I want everything in layers but scverse is not there yet.

and why repeat the old mistake for new packages

Because it builds upon scanpy which has the assumption that it works with X by default. But yeah, I could probably pass a default layer everywhere and modify that behavior.