Dumb Question - project new data onto existing data with scVI

davemcg commented 4 years ago

Is your feature request related to a problem? Please describe. If I've gone through the trouble of making a scVI --> UMAP representation I'd like if I could "keep" the exisiting UMAP projection if I add a (relatively) small number of cells. This is possible with UMAP (https://cran.r-project.org/web/packages/umap/vignettes/umap.html). But this would require the new data to have equivalent scVI latent dimensions.

Describe the solution you'd like Is there a path towards re-using an existing scVI model to incorporate new data?

adamgayoso commented 4 years ago

You should stay tuned for upcoming features in a few weeks. This is not possible right now.

Edit: well, do the cells come from one of the same batches as the cells used to train the model?

davemcg commented 4 years ago

No, they do not.

I will wait.

adamgayoso commented 4 years ago

I should have a beta release out soon you can try (with tutorial)

davemcg commented 4 years ago

Oooooooooo

adamgayoso commented 4 years ago

The tutorial will be on the latest version of the docs in the next 10 minutes (online updates in the title). You can use the new beta I just released on pypi (0.8.0 beta).

The downside is that we broke the ability to load models trained on older versions...

You can read the release notes on the latest docs versino too.

We are waiting for a few things before we make an official announcement.

davemcg commented 4 years ago

Cool - this? https://www.scvi-tools.org/en/latest/user_guide/notebooks/scarches_scvi_tools.html

davemcg commented 4 years ago

What's the interplay of scArches and scVI? Did they just yank the code over or is there a deeper collab? https://scarchest.readthedocs.io/en/latest/api/index.html

adamgayoso commented 4 years ago

They will mostly wrap our API and enable you to deposit models to their model repository. Functionality should be exactly the same.

We will also have this implemented for multiple covariates soon.

adamgayoso commented 4 years ago

Let me know how it works for you!

davemcg commented 4 years ago

Had a hiccup at this step. I believe I followed along properly up 'till now:

adata_full.obsm["X_scVI"] = vae_q.get_latent_representation(adata_full)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-42-7353683ad730> in <module>
      1 #adata_full.obsm["X_scVI"] =
----> 2 vae_q.get_latent_representation(adata_full)

/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_no_grad(*args, **kwargs)
     47         def decorate_no_grad(*args, **kwargs):
     48             with self:
---> 49                 return func(*args, **kwargs)
     50         return decorate_no_grad
     51 

/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/scvi/core/models/vaemixin.py in get_latent_representation(self, adata, indices, give_mean, mc_samples, batch_size)
    205             raise RuntimeError("Please train the model first.")
    206 
--> 207         adata = self._validate_anndata(adata)
    208         scdl = self._make_scvi_dl(adata=adata, indices=indices, batch_size=batch_size)
    209         latent = []

/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/scvi/core/models/base.py in _validate_anndata(self, adata, copy_if_view)
     90             transfer_anndata_setup(self.scvi_setup_dict_, adata)
     91         is_nonneg_int = _check_nonnegative_integers(
---> 92             get_from_registry(adata, _CONSTANTS.X_KEY)
     93         )
     94         if not is_nonneg_int:

/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/scvi/data/_anndata.py in get_from_registry(adata, key)
     61            [0]])
     62     """
---> 63     data_loc = adata.uns["_scvi"]["data_registry"][key]
     64     attr_name, attr_key = data_loc["attr_name"], data_loc["attr_key"]
     65 

KeyError: 'data_registry'

Let me know if you want a fuller reproduction.

davemcg commented 4 years ago

raise RuntimeError("Please train the model first.")

Is the issue? But I did train it....(sorry for the pic but I'm rushing a bit now):

davemcg commented 4 years ago

Oh, didn't read all the way down this time:

ValueError: Making .obs["batch"] categorical failed. Expected categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2'
 'E-MTAB-7316_10xv2_Donor3' 'OGVFB_Hufnagel_iPSC_RPE_10xv2_None'
 'SRP151023_10xv2_NA' 'SRP170761_10xv2_NA' 'SRP194595_10xv3_Donor1'
 'SRP194595_10xv3_Donor2' 'SRP194595_10xv3_Donor3'
 'SRP218652_10xv3_donor1' 'SRP218652_10xv3_donor2'
 'SRP218652_10xv3_donor3' 'SRP218652_10xv3_donor4'
 'SRP218652_10xv3_donor5' 'SRP218652_10xv3_donor6'
 'SRP218652_10xv3_donor7' 'SRP222001_10xv2_retina1'
 'SRP222001_10xv2_retina2' 'SRP222001_10xv2_retina3'
 'SRP222958_DropSeq_retina2' 'SRP222958_DropSeq_retina6'
 'SRP222958_DropSeq_retina8' 'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2'
 'SRP238587_10xv2_NA' 'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2'
 'SRP255195_10xv2_H3' 'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5'
 'SRP255195_10xv3_H1' 'SRP257883_10xv3_donor_22'
 'SRP257883_10xv3_donor_23' 'SRP257883_10xv3_donor_24'
 'SRP257883_10xv3_donor_25' 'SRP050054_DropSeq_retina1'
 'SRP050054_DropSeq_retina2' 'SRP050054_DropSeq_retina3'
 'SRP050054_DropSeq_retina4' 'SRP050054_DropSeq_retina5'
 'SRP050054_DropSeq_retina6' 'SRP050054_DropSeq_retina7'
 'SRP075719_DropSeq_Batch1' 'SRP075719_DropSeq_Batch2'
 'SRP131661_10xv2_3-F-56' 'SRP131661_10xv2_3-F-57'
 'SRP131661_10xv2_3-M-5/6' 'SRP131661_10xv2_3-M-7/8'
 'SRP131661_10xv2_3-M-8' 'SRP131661_10xv2_3-M-8/9' 'SRP131661_10xv2_3-M-9'
 'SRP157927_10xv2_Macaque1' 'SRP157927_10xv2_Macaque2'
 'SRP157927_10xv2_Macaque3' 'SRP157927_10xv2_Macaque4'
 'SRP158081_10xv2_Rep1' 'SRP158081_10xv2_Rep2' 'SRP158081_10xv2_Rep3'
 'SRP158528_10xv2_Macaque1' 'SRP158528_10xv2_Macaque2'
 'SRP158528_10xv2_Macaque3' 'SRP158528_10xv2_Macaque4'
 'SRP166660_10xv2_run1' 'SRP166660_10xv2_run2' 'SRP168426_10xv2_E2'
 'SRP168426_10xv2_F2' 'SRP186407_10xv2_NA' 'SRP200599_10xv2_NA'
 'SRP212151_10xv2_Batch1' 'SRP212151_10xv2_Batch2'
 'SRP212151_10xv2_Batch3']. Received categories: Index(['0', '1'], dtype='object').

davemcg commented 4 years ago

So I've borked the batch somehow...

Ah:


adata_full.obs['batch']
Out[47]:
AAACCTGAGAGCCCAA_SRS3044263-0    0
AAACCTGAGCTAGTCT_SRS3044263-0    0
AAACCTGAGGATGTAT_SRS3044263-0    0
AAACCTGAGGGATACC_SRS3044263-0    0
AAACCTGAGTTTAGGA_SRS3044263-0    0
                                ..
TGTTCGAAAGCG_SRS5421626-1        1
TTAGCAAACGCA_SRS5421626-1        1
TTAGTTCAGTTA_SRS5421626-1        1
TTGAATCCACAC_SRS5421626-1        1
TTTAGTGATACG_SRS5421626-1        1
Name: batch, Length: 1015617, dtype: category
Categories (2, object): ['0', '1']

davemcg commented 4 years ago

Sorry for the real time stupidity. When I get a minute I'll review how I borked the concatenation....

davemcg commented 4 years ago

Ah....I was using "batch" as the study/covariate value...which anndata overwrites when you concatenate....oi. All is fine (for now).

adamgayoso commented 4 years ago

Ah yes, that's an unfortunate side effect of that concatenate function. They have a new concat function that avoids this issue I believe, but you could also just set the batch_key of concatenate. We should probably make that more clear in the tutorial...

davemcg commented 4 years ago

First attempt looks ... terrible and unusable.

So bad in fact, I went through step by step to see if I messed up. And I did. I messed up a join and fed UMAP a 1e6 x 2000 matrix of NA. Which also explains why it took 16 hours to run.

So I fixed the join (I didn't realize the anndata/scanpy appended a -0 and a -1 to the reference / query cell labels), and ran again.

And wow....this looks VERY VERY amazing. I haven't run the query-only UMAP, but I I expect that it should look pretty similar based on what you were showing in the tutorial.

Briefly, the human data (~400,000 cells) was the ref and I projected the macaque and mouse data (~600,000) onto it (query). The big gap in the macaque data on the left edge / middle is 100% expected as those are the progenitor populations that don't exist in the macaque data. The colors look a touch noisy as they are cell type labels (NOT clusters).

I'm really excited as when I previously ran some prototype scVI runs on the species separately I was getting some wonky results with the human looking great but the mouse looking odd (not bad, just weird). So I decided to go crazy and try use the human as the good ref and project everything else onto it.

This is super duper exciting as this provides a path where other people can project their data onto mine (as long as they match the kallisto / reference transcriptome) without having to download this huge dataset.

adamgayoso commented 4 years ago

What you might try next is either 1) use scANVI, or 2) train a classifier (any really) on the reference latent space with your cell type labels, as the reference latent representation won't change (if you use tutorial params etc.).

Then, user can project into your dataset and get cell type labels for free without having to compute nearest neighbors.

davemcg commented 3 years ago

Ah, that's an excellent point. We do have an xgboost model for cell type prediction that is mostly based on the scVI latent dims (not throwing any shade on scANVI, the base code for this predates scANVI plus we had some odd stuff we were tossing in).

I've been running our benchmarking suite on mouse/macaque projected onto human and it performs better than scVI attempting to integrated all at once. Extra bonus!

I've been referring to this as scVI projection; is there a alternative term that I should be using?

davemcg commented 3 years ago

Apologies if this should be a feature request, but given I'm asking about a "beta" feature....

var_names for adata passed in does not match var_names of adata used to train the  
          model. For valid results, the vars need to be the same and in the same order as the
          adata used to train the model.

Is it possible for scvi.model.SCVI.load_query_data to subset and re-arrange the adata object to match the var_names in the scVI model?

This just tripped me up and I imagine other people will get confused.

davemcg commented 3 years ago

I'd also argue that the scvi.model.SCVI.load_query_data should error if the var_names are missing into the adata object provided.

adamgayoso commented 3 years ago

Is it possible for scvi.model.SCVI.load_query_data to subset and re-arrange the adata object to match the var_names in the scVI model?

It's possible but I was hesitant to go around manipulating anndata objects in memory. What do you think?

I'd also argue that the scvi.model.SCVI.load_query_data should error if the var_names are missing into the adata object provided.

This also makes sense to me.

adamgayoso commented 3 years ago

I added this as an option -- it will error out if gene names are missing in query.

scverse / scvi-tools

Dumb Question - project new data onto existing data with scVI #849