Cellxgene schema 3.0.0 breaks dataloader

emdann commented 2 years ago

Hi there, I'm trying to use the data loader to access cellxgene collections (following the tutorial). I've run this before without problems but now it throws a new KeyError. I think this has to do with the fact that cellxgene changed their metadata column to self_reported_ethnicity.

To Reproduce

import anndata
import os
import sfaira

cache_path = os.path.join(".", "data")
dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [5], in <cell line: 6>()
      3 import sfaira
      5 cache_path = os.path.join(".", "data2")
----> 6 dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/super_group.py:17, in DatasetSuperGroupDatabases.__init__(self, data_path, meta_path, cache_path, cache_metadata)
      9 def __init__(
     10         self,
     11         data_path: Union[str, None] = None,
   (...)
     14         cache_metadata: bool = False,
     15 ):
     16     dataset_super_groups = [
---> 17         DatasetSuperGroupCellxgene(
     18             data_path=data_path,
     19             meta_path=meta_path,
     20             cache_path=cache_path,
     21             cache_metadata=cache_metadata,
     22         ),
     23     ]
     24     super().__init__(dataset_groups=dataset_super_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:83, in DatasetSuperGroupCellxgene.__init__(self, data_path, meta_path, cache_path, cache_metadata, verbose)
     81     print("WARNING: Zero cellxgene collections retrieved.")
     82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
---> 83 dataset_groups = [
     84     DatasetGroupCellxgene(
     85         collection_id=x["id"],
     86         data_path=data_path,
     87         meta_path=meta_path,
     88         cache_path=cache_path,
     89         cache_metadata=cache_metadata,
     90         verbose=verbose,
     91     )
     92     for x in collections
     93 ]
     94 super().__init__(dataset_groups=dataset_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:84, in <listcomp>(.0)
     81     print("WARNING: Zero cellxgene collections retrieved.")
     82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
     83 dataset_groups = [
---> 84     DatasetGroupCellxgene(
     85         collection_id=x["id"],
     86         data_path=data_path,
     87         meta_path=meta_path,
     88         cache_path=cache_path,
     89         cache_metadata=cache_metadata,
     90         verbose=verbose,
     91     )
     92     for x in collections
     93 ]
     94 super().__init__(dataset_groups=dataset_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:33, in DatasetGroupCellxgene.__init__(self, collection_id, data_path, meta_path, cache_path, cache_metadata, verbose)
     31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
     32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
---> 33 datasets = [
     34     Dataset(
     35         collection_id=collection_id,
     36         data_path=data_path,
     37         meta_path=meta_path,
     38         cache_path=cache_path,
     39         load_func=load_func,
     40         sample_fn=x,
     41         sample_fns=dataset_ids,
     42         cache_metadata=cache_metadata,
     43         verbose=verbose,
     44     )
     45     for x in dataset_ids
     46 ]
     47 keys = [x.id for x in datasets]
     48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:34, in <listcomp>(.0)
     31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
     32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
     33 datasets = [
---> 34     Dataset(
     35         collection_id=collection_id,
     36         data_path=data_path,
     37         meta_path=meta_path,
     38         cache_path=cache_path,
     39         load_func=load_func,
     40         sample_fn=x,
     41         sample_fns=dataset_ids,
     42         cache_metadata=cache_metadata,
     43         verbose=verbose,
     44     )
     45     for x in dataset_ids
     46 ]
     47 keys = [x.id for x in datasets]
     48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_loader.py:107, in Dataset.__init__(self, collection_id, data_path, meta_path, cache_path, load_func, dict_load_func_annotation, yaml_path, sample_fn, sample_fns, additional_annotation_key, cache_metadata, verbose, **kwargs)
    105 reordered_keys = ["organism"] + [x for x in self._adata_ids_cellxgene.dataset_keys if x != "organism"]
    106 for k in reordered_keys:
--> 107     val = self._collection_dataset[getattr(self._adata_ids_cellxgene, k)]
    108     # Unique label if list is length 1:
    109     # Otherwise do not set property and resort to cell-wise labels.
    110     v_clean = clean_cellxgene_meta_uns(k=k, val=val, adata_ids=self._adata_ids_cellxgene)

KeyError: 'ethnicity'

System:

sfaira version: v0.3.12
- OS: Ubuntu 20.04.1 LTS
- Python 3.10.6
- Virtual environment: Conda

emdann commented 2 years ago

Update: I tried to fix the ethnicity annotation in cellxgene loader, but there's more to it coming from recent update of cellxgene schema.

One critical change seems to be that collection metadata no longer stores info on whether adata.X is preprocessed or not in x_normalization.

davidsebfischer commented 2 years ago

Thank you Emma, will implement the schema changes!

davidsebfischer commented 2 years ago

Update - I could replicate this in the unit tests, will try to push a fix soon!

davidsebfischer commented 2 years ago

Update, I expect to merge the fix into dev this week.

davidsebfischer commented 2 years ago

The fix is merged into dev now and it seems working. We will need to be careful with using this with existing schema version 2 data sets but it should work well with data downloaded entirely under version 3. Let me know if any more issues come up, especially in continued work with version 2. I will wait with releasing until it is clear that this is stable on all applications.

theislab / sfaira

Cellxgene schema 3.0.0 breaks dataloader #694