scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 154 forks source link

adata.reindex() #531

Open gokceneraslan opened 3 years ago

gokceneraslan commented 3 years ago

Hey,

When I apply a classifier trained on dataset A to dataset B, the need for matching the genes of both arises. I guess this is equivalent to the reindex function in pandas.

Given that

  1. we already have necessary internal functions (e.g. things used by outer join),
  2. classifiers and annotations are becoming increasingly important as more datasets are generated,
  3. pandas have something similar (e.g. behavior is well defined for var and obs),

I think it'd be useful for the community to have this in AnnData.

Here, one can obviously train the classifier only with the genes shared by A and B, or concatenate+split them but imagine that you have classifier already trained with a set of genes, and you are making predictions for a new dataset, which clearly requires reindex().

Right now I am using the following idiotic function to do that

import pandas as pd
import anndata as ad
import numpy as np

def poor_mans_buggy_reindex(labels, adata, axis=1, fill_value=None):
    if axis == 1:
        ph = ad.AnnData(np.ones([0, len(labels)]), var=pd.DataFrame(index=labels))
        ret = ad.concat([adata, ph], join='outer', axis=0, fill_value=fill_value)
        return ret[:, labels].copy()
    else:
        ph = ad.AnnData(np.ones([len(labels), 0]), obs=pd.DataFrame(index=labels))   
        ret = ad.concat([adata, ph], join='outer', axis=1, fill_value=fill_value)
        return ret[labels].copy()    

Here is how it looks like:

image

However this doesn't work when axis=0, e.g.

new_adata = poor_mans_buggy_reindex(['0', '1', 'A', 'B'], adata, axis=0)
new_adata.obs
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-a81e408604f3> in <module>
----> 1 new_adata = poor_mans_buggy_reindex(['0', '1', 'A', 'B'], adata, axis=0)
      2 new_adata.obs

<ipython-input-3-ca94971cbd15> in poor_mans_buggy_reindex(labels, adata, axis, fill_value)
      6     else:
      7         ph = sc.AnnData(np.ones([len(labels), 0]), obs=pd.DataFrame(index=labels))
----> 8         ret = ad.concat([adata, ph], join='outer', axis=1, fill_value=fill_value)
      9         return ret[labels].copy()

~/.miniconda3/lib/python3.8/site-packages/anndata/_core/merge.py in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    816     )
    817 
--> 818     X = concat_arrays(
    819         [a.X for a in adatas], reindexers, axis=axis, fill_value=fill_value
    820     )

~/.miniconda3/lib/python3.8/site-packages/anndata/_core/merge.py in concat_arrays(arrays, reindexers, axis, index, fill_value)
    422         )
    423     else:
--> 424         return np.concatenate(
    425             [
    426                 f(x, fill_value=fill_value, axis=1 - axis)

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2732 and the array at index 1 has size 4

which also links this issue to #526.

ivirshup commented 3 years ago

More thoughts on this: https://github.com/theislab/anndata/issues/441#issuecomment-730882826

I've been thinking this could be a good part of a anndata.align(adatas: list[AnnData], *, dim: int, kind: Literal) -> list[AnnData] function, for aligning a set of AnnDatas. Most of this could probably be stripped out of concat.