scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
524 stars 150 forks source link

Incorrect merging of `pd.Series` in `AnnCollection` #1352

Open ordabayevy opened 5 months ago

ordabayevy commented 5 months ago

Please make sure these conditions are met

Report

Problem:

I create an AnnCollection without harmonization and then want to access some categorical obs column, e.g. disease (example below) or cell_type. Indexing cells from one anndata object and then accessing the attribute works as expected (returns categorical pd.Series). However, if cells are indexed from multiple anndata objects then accessing the attribute returns a numpy array with dtype=object. Looking at the source code the problem seems to lie in the concat_arrays function that does not have a logic for handling pd.Series arrays:

https://github.com/scverse/anndata/blob/c790113fbbc13db505a3bfc98576d8da0139d90b/anndata/_core/merge.py#L746

Code:

import gdown
import scanpy as sc
from anndata.experimental.multi_files import AnnCollection

# the data is from this scvi reproducibility notebook
# https://yoseflab.github.io/scvi-tools-reproducibility/scarches_totalvi_seurat_data/
gdown.download(
    url="https://drive.google.com/uc?id=1JgaXNwNeoEqX7zJL-jJD3cfXDGurMrq9", output="covid_cite.h5ad", quiet=False
)

covid = sc.read("covid_cite.h5ad")

dataset = AnnCollection([covid, covid], join_obs=None, join_obsm=None, join_vars=None, harmonize_dtypes=False)

dataset[0].obs["disease"]
# expected result
# AAACCCACACCAGCGT-1    COVID-19
# Name: disease, dtype: category
# Categories (2, object): ['COVID-19', 'Healthy']

dataset[[0, 60000]].obs["disease"]
# unexpected result
# array(['COVID-19', 'COVID-19'], dtype=object)

Versions

-----
anndata             0.10.5.post1
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.8.2
exceptiongroup      1.1.3
google              NA
h5py                3.10.0
natsort             8.4.0
numpy               1.26.1
packaging           23.2
pandas              2.1.1
pyarrow             15.0.0
pynvml              NA
pytz                2023.3.post1
scipy               1.11.3
six                 1.16.0
sphinxcontrib       NA
torch               2.2.0+cu121
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
zoneinfo            NA
-----
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
Linux-4.19.0-26-cloud-amd64-x86_64-with-glibc2.28
-----
Session information updated at 2024-02-01 16:10
github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!