adata.reindex() - Githubissues

Hey,

When I apply a classifier trained on dataset A to dataset B, the need for matching the genes of both arises. I guess this is equivalent to the reindex function in pandas.

Given that

we already have necessary internal functions (e.g. things used by outer join),
classifiers and annotations are becoming increasingly important as more datasets are generated,
pandas have something similar (e.g. behavior is well defined for var and obs),

I think it'd be useful for the community to have this in AnnData.

Here, one can obviously train the classifier only with the genes shared by A and B, or concatenate+split them but imagine that you have classifier already trained with a set of genes, and you are making predictions for a new dataset, which clearly requires reindex().

Right now I am using the following idiotic function to do that

import pandas as pd
import anndata as ad
import numpy as np

def poor_mans_buggy_reindex(labels, adata, axis=1, fill_value=None):
    if axis == 1:
        ph = ad.AnnData(np.ones([0, len(labels)]), var=pd.DataFrame(index=labels))
        ret = ad.concat([adata, ph], join='outer', axis=0, fill_value=fill_value)
        return ret[:, labels].copy()
    else:
        ph = ad.AnnData(np.ones([len(labels), 0]), obs=pd.DataFrame(index=labels))   
        ret = ad.concat([adata, ph], join='outer', axis=1, fill_value=fill_value)
        return ret[labels].copy()

Here is how it looks like:

However this doesn't work when axis=0, e.g.

new_adata = poor_mans_buggy_reindex(['0', '1', 'A', 'B'], adata, axis=0)
new_adata.obs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-a81e408604f3> in <module>
----> 1 new_adata = poor_mans_buggy_reindex(['0', '1', 'A', 'B'], adata, axis=0)
      2 new_adata.obs

<ipython-input-3-ca94971cbd15> in poor_mans_buggy_reindex(labels, adata, axis, fill_value)
      6     else:
      7         ph = sc.AnnData(np.ones([len(labels), 0]), obs=pd.DataFrame(index=labels))
----> 8         ret = ad.concat([adata, ph], join='outer', axis=1, fill_value=fill_value)
      9         return ret[labels].copy()

~/.miniconda3/lib/python3.8/site-packages/anndata/_core/merge.py in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    816     )
    817 
--> 818     X = concat_arrays(
    819         [a.X for a in adatas], reindexers, axis=axis, fill_value=fill_value
    820     )

~/.miniconda3/lib/python3.8/site-packages/anndata/_core/merge.py in concat_arrays(arrays, reindexers, axis, index, fill_value)
    422         )
    423     else:
--> 424         return np.concatenate(
    425             [
    426                 f(x, fill_value=fill_value, axis=1 - axis)

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2732 and the array at index 1 has size 4

which also links this issue to #526.

scverse / anndata

adata.reindex() #531