snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
120 stars 15 forks source link

Ontology ID mappings #43

Closed bschilder closed 3 weeks ago

bschilder commented 3 weeks ago

One of the biggest advantages of CELLXGENE Census is the fact that they've mapped all of the cell types, tissues, species, etc to common ontology terms (eg Cell Ontology, UBERON). This is super helpful for systematic evaluations, for example, of ontology-based distances vs embeddings-based distances. It also makes it much easier to compare with new sc datasets.

However, I've noticed that the subsampled IMA dataset doesn't seem to have these IDs.

Reprex

ref_path = 'data/IMA_sample.h5ad'
import gdown
if not os.path.exists("data/IMA_sample.h5ad"):
    gdown.download(id="16UyzyZ7jK4y5Mj0PT75vqPO68i729soq", output="data/")
ref = anndata.read_h5ad(ref_path, backed='r')
ref
AnnData object with n_obs × n_vars = 2969114 × 1280 backed at 'data/IMA_sample.h5ad'
    obs: 'cell_type', 'tissue', 'idx', 'dataset', 'species', 'coarse_cell_type'
    uns: 'cell_type_gpt_colors', 'coarse_cell_type_yanay_colors', 'dataset_colors', 'neighbors', 'pca', 'species_colors', 'tissue_colors', 'umap', 'dendrogram_coarse_cell_type', 'dendrogram_cell_type'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Requests

The essentials

Would it be possible for someone to add these to the IMA object, or at least provide a mapping file for the following fields :

The nice-to-haves

There's a lot more fields that CELLxGENE provides, but I think these are some of the more essential. Examples of other fields you may want to consider adding that could also be helpful.

(Randomly selected dataset: https://cellxgene.cziscience.com/collections/d2684035-a36e-458e-96af-8e37930bfdf6)

'data.frame':   10533 obs. of  46 variables:
 $ mapped_reference_assembly               : Factor w/ 1 level "GRCh38": 1 1 1 1 1 1 1 1 1 1 ...
 $ mapped_reference_annotation             : Factor w/ 1 level "GENCODE 33": 1 1 1 1 1 1 1 1 1 1 ...
 $ alignment_software                      : Factor w/ 1 level "kallisto bustools": 1 1 1 1 1 1 1 1 1 1 ...
 $ donor_id                                : Factor w/ 4 levels "MSK0782","MSK1139",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ self_reported_ethnicity_ontology_term_id: Factor w/ 2 levels "HANCESTRO:0462",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ donor_living_at_sample_collection       : Factor w/ 1 level "True": 1 1 1 1 1 1 1 1 1 1 ...
 $ organism_ontology_term_id               : Factor w/ 1 level "NCBITaxon:9606": 1 1 1 1 1 1 1 1 1 1 ...
 $ sample_uuid                             : Factor w/ 4 levels "2cd105b1-6c73-4f7a-a5aa-9058773657f0",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ sample_preservation_method              : Factor w/ 1 level "flash-freezing": 1 1 1 1 1 1 1 1 1 1 ...
 $ tissue_ontology_term_id                 : Factor w/ 1 level "UBERON:8480009": 1 1 1 1 1 1 1 1 1 1 ...
 $ development_stage_ontology_term_id      : Factor w/ 4 levels "HsapDv:0000112",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sample_derivation_process               : Factor w/ 1 level "resection": 1 1 1 1 1 1 1 1 1 1 ...
 $ sample_source                           : Factor w/ 1 level "Oxford": 1 1 1 1 1 1 1 1 1 1 ...
 $ donor_BMI_at_collection                 : num  29.7 29.7 29.7 29.7 29.7 ...
 $ tissue_type                             : Factor w/ 1 level "tissue": 1 1 1 1 1 1 1 1 1 1 ...
 $ suspension_derivation_process           : Factor w/ 1 level "mechanical dissociation,detergent solubilization": 1 1 1 1 1 1 1 1 1 1 ...
 $ suspension_dissociation_reagent         : Factor w/ 1 level "0.5% CHAPS": 1 1 1 1 1 1 1 1 1 1 ...
 $ suspension_dissociation_time            : Factor w/ 1 level "10 minute": 1 1 1 1 1 1 1 1 1 1 ...
 $ suspension_uuid                         : Factor w/ 4 levels "473e3da0-fcca-4808-bd47-614237d76293",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ suspension_type                         : Factor w/ 1 level "nucleus": 1 1 1 1 1 1 1 1 1 1 ...
 $ tissue_handling_interval                : Factor w/ 1 level "<2 hours": 1 1 1 1 1 1 1 1 1 1 ...
 $ library_uuid                            : Factor w/ 4 levels "7b00afb9-2727-44e0-97c5-918c20c03836",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ assay_ontology_term_id                  : Factor w/ 1 level "EFO:0009922": 1 1 1 1 1 1 1 1 1 1 ...
 $ library_starting_quantity               : Factor w/ 1 level "200-1000 nuclei": 1 1 1 1 1 1 1 1 1 1 ...
 $ sequencing_platform                     : Factor w/ 1 level "Illumina NovaSeq 6000": 1 1 1 1 1 1 1 1 1 1 ...
 $ is_primary_data                         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ cell_type_ontology_term_id              : Factor w/ 11 levels "CL:0000057","CL:0000136",..: 4 4 4 3 3 3 10 4 3 3 ...
 $ author_cell_type                        : Factor w/ 12 levels "Adipocytes","Fast-twitch skeletal muscle cells",..: 2 2 2 10 10 10 11 2 10 10 ...
 $ disease_ontology_term_id                : Factor w/ 1 level "PATO:0000461": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex_ontology_term_id                    : Factor w/ 2 levels "PATO:0000383",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ nCount_RNA                              : num  3296 2368 1839 4179 2098 ...
 $ nFeature_RNA                            : int  1588 1128 948 1697 1235 1786 1886 2075 2001 2471 ...
 $ nUMI                                    : num  3296 2368 1839 4179 2098 ...
 $ nGene                                   : int  1588 1128 948 1697 1235 1786 1886 2075 2001 2471 ...
 $ mitoRatio                               : num  0.00698 0.01267 0.01468 0.00479 0.01192 ...
 $ scDblFinder.score                       : num  0.34001 0.00362 0.05401 0.00432 0.00243 ...
 $ decontX_contamination                   : num  0.0332 0.0259 0.0414 0.0208 0.1275 ...
 $ cell_type                               : Factor w/ 11 levels "fibroblast","adipocyte",..: 4 4 4 3 3 3 10 4 3 3 ...
 $ assay                                   : Factor w/ 1 level "10x 3' v3": 1 1 1 1 1 1 1 1 1 1 ...
 $ disease                                 : Factor w/ 1 level "normal": 1 1 1 1 1 1 1 1 1 1 ...
 $ organism                                : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex                                     : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
 $ tissue                                  : Factor w/ 1 level "tendon of semitendinosus": 1 1 1 1 1 1 1 1 1 1 ...
 $ self_reported_ethnicity                 : Factor w/ 2 levels "British","unknown": 1 1 1 1 1 1 1 1 1 1 ...
 $ development_stage                       : Factor w/ 4 levels "18-year-old human stage",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ observation_joinid                      : chr  "HC?&7#nKqT" "A@!z4^48C4" "iPJXgW&?ZN" "r6h~KoRHg6" ...
Yanay1 commented 3 weeks ago

The IMA has data that is not from cell x gene so that wouldn't be possible.

For the cell x gene data, I would recommend using their models api: https://cellxgene.cziscience.com/census-models which lets you query UCE embeddings and cell x gene metadata for all cells on cell x gene (a superset of the UCE human and mouse training data)*

The api lets you make really nice and complex queries.

*there are a few additional human and mouse datasets in the training data that might not be from cell x gene.

bschilder commented 3 weeks ago

The IMA has data that is not from cell x gene so that wouldn't be possible.

If see. Even so, could you not just provide the metadata for the CELLxGENE subset set NaN for the non-CELLxGENE datasets? My impression was the preprint was that the vast majority of the IMA was from CELLxGENE.

For the cell x gene data, I would recommend using their models api: https://cellxgene.cziscience.com/census-models which lets you query UCE embeddings and cell x gene metadata for all cells on cell x gene (a superset of the UCE human and mouse training data)*

The api lets you make really nice and complex queries.

*there are a few additional human and mouse datasets in the training data that might not be from cell x gene.

This is great to know, thanks! I'll check this out in tandem.

bschilder commented 3 weeks ago

@Yanay1 would you mind reopening this issue? I don't think the fact that IMA has a few non-CELLxGENE Census datasets is a reason to not include the ontology IDs for any of the cells.

Thanks!