Closed davemcg closed 3 years ago
You should stay tuned for upcoming features in a few weeks. This is not possible right now.
Edit: well, do the cells come from one of the same batches as the cells used to train the model?
No, they do not.
I will wait.
I should have a beta release out soon you can try (with tutorial)
Oooooooooo
The tutorial will be on the latest
version of the docs in the next 10 minutes (online updates in the title). You can use the new beta I just released on pypi (0.8.0 beta).
The downside is that we broke the ability to load models trained on older versions...
You can read the release notes on the latest docs versino too.
We are waiting for a few things before we make an official announcement.
What's the interplay of scArches and scVI? Did they just yank the code over or is there a deeper collab? https://scarchest.readthedocs.io/en/latest/api/index.html
They will mostly wrap our API and enable you to deposit models to their model repository. Functionality should be exactly the same.
We will also have this implemented for multiple covariates soon.
Let me know how it works for you!
Had a hiccup at this step. I believe I followed along properly up 'till now:
adata_full.obsm["X_scVI"] = vae_q.get_latent_representation(adata_full)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-42-7353683ad730> in <module>
1 #adata_full.obsm["X_scVI"] =
----> 2 vae_q.get_latent_representation(adata_full)
/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_no_grad(*args, **kwargs)
47 def decorate_no_grad(*args, **kwargs):
48 with self:
---> 49 return func(*args, **kwargs)
50 return decorate_no_grad
51
/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/scvi/core/models/vaemixin.py in get_latent_representation(self, adata, indices, give_mean, mc_samples, batch_size)
205 raise RuntimeError("Please train the model first.")
206
--> 207 adata = self._validate_anndata(adata)
208 scdl = self._make_scvi_dl(adata=adata, indices=indices, batch_size=batch_size)
209 latent = []
/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/scvi/core/models/base.py in _validate_anndata(self, adata, copy_if_view)
90 transfer_anndata_setup(self.scvi_setup_dict_, adata)
91 is_nonneg_int = _check_nonnegative_integers(
---> 92 get_from_registry(adata, _CONSTANTS.X_KEY)
93 )
94 if not is_nonneg_int:
/data/mcgaugheyd/conda/envs/scvitools/lib/python3.7/site-packages/scvi/data/_anndata.py in get_from_registry(adata, key)
61 [0]])
62 """
---> 63 data_loc = adata.uns["_scvi"]["data_registry"][key]
64 attr_name, attr_key = data_loc["attr_name"], data_loc["attr_key"]
65
KeyError: 'data_registry'
Let me know if you want a fuller reproduction.
raise RuntimeError("Please train the model first.")
Is the issue? But I did train it....(sorry for the pic but I'm rushing a bit now):
Oh, didn't read all the way down this time:
ValueError: Making .obs["batch"] categorical failed. Expected categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2'
'E-MTAB-7316_10xv2_Donor3' 'OGVFB_Hufnagel_iPSC_RPE_10xv2_None'
'SRP151023_10xv2_NA' 'SRP170761_10xv2_NA' 'SRP194595_10xv3_Donor1'
'SRP194595_10xv3_Donor2' 'SRP194595_10xv3_Donor3'
'SRP218652_10xv3_donor1' 'SRP218652_10xv3_donor2'
'SRP218652_10xv3_donor3' 'SRP218652_10xv3_donor4'
'SRP218652_10xv3_donor5' 'SRP218652_10xv3_donor6'
'SRP218652_10xv3_donor7' 'SRP222001_10xv2_retina1'
'SRP222001_10xv2_retina2' 'SRP222001_10xv2_retina3'
'SRP222958_DropSeq_retina2' 'SRP222958_DropSeq_retina6'
'SRP222958_DropSeq_retina8' 'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2'
'SRP238587_10xv2_NA' 'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2'
'SRP255195_10xv2_H3' 'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5'
'SRP255195_10xv3_H1' 'SRP257883_10xv3_donor_22'
'SRP257883_10xv3_donor_23' 'SRP257883_10xv3_donor_24'
'SRP257883_10xv3_donor_25' 'SRP050054_DropSeq_retina1'
'SRP050054_DropSeq_retina2' 'SRP050054_DropSeq_retina3'
'SRP050054_DropSeq_retina4' 'SRP050054_DropSeq_retina5'
'SRP050054_DropSeq_retina6' 'SRP050054_DropSeq_retina7'
'SRP075719_DropSeq_Batch1' 'SRP075719_DropSeq_Batch2'
'SRP131661_10xv2_3-F-56' 'SRP131661_10xv2_3-F-57'
'SRP131661_10xv2_3-M-5/6' 'SRP131661_10xv2_3-M-7/8'
'SRP131661_10xv2_3-M-8' 'SRP131661_10xv2_3-M-8/9' 'SRP131661_10xv2_3-M-9'
'SRP157927_10xv2_Macaque1' 'SRP157927_10xv2_Macaque2'
'SRP157927_10xv2_Macaque3' 'SRP157927_10xv2_Macaque4'
'SRP158081_10xv2_Rep1' 'SRP158081_10xv2_Rep2' 'SRP158081_10xv2_Rep3'
'SRP158528_10xv2_Macaque1' 'SRP158528_10xv2_Macaque2'
'SRP158528_10xv2_Macaque3' 'SRP158528_10xv2_Macaque4'
'SRP166660_10xv2_run1' 'SRP166660_10xv2_run2' 'SRP168426_10xv2_E2'
'SRP168426_10xv2_F2' 'SRP186407_10xv2_NA' 'SRP200599_10xv2_NA'
'SRP212151_10xv2_Batch1' 'SRP212151_10xv2_Batch2'
'SRP212151_10xv2_Batch3']. Received categories: Index(['0', '1'], dtype='object').
So I've borked the batch
somehow...
Ah:
adata_full.obs['batch']
Out[47]:
AAACCTGAGAGCCCAA_SRS3044263-0 0
AAACCTGAGCTAGTCT_SRS3044263-0 0
AAACCTGAGGATGTAT_SRS3044263-0 0
AAACCTGAGGGATACC_SRS3044263-0 0
AAACCTGAGTTTAGGA_SRS3044263-0 0
..
TGTTCGAAAGCG_SRS5421626-1 1
TTAGCAAACGCA_SRS5421626-1 1
TTAGTTCAGTTA_SRS5421626-1 1
TTGAATCCACAC_SRS5421626-1 1
TTTAGTGATACG_SRS5421626-1 1
Name: batch, Length: 1015617, dtype: category
Categories (2, object): ['0', '1']
Sorry for the real time stupidity. When I get a minute I'll review how I borked the concatenation....
Ah....I was using "batch" as the study/covariate value...which anndata overwrites when you concatenate....oi. All is fine (for now).
Ah yes, that's an unfortunate side effect of that concatenate function. They have a new concat
function that avoids this issue I believe, but you could also just set the batch_key
of concatenate. We should probably make that more clear in the tutorial...
First attempt looks ... terrible and unusable.
So bad in fact, I went through step by step to see if I messed up. And I did. I messed up a join and fed UMAP a 1e6 x 2000 matrix of NA. Which also explains why it took 16 hours to run.
So I fixed the join (I didn't realize the anndata/scanpy appended a -0
and a -1
to the reference / query cell labels), and ran again.
And wow....this looks VERY VERY amazing. I haven't run the query-only UMAP, but I I expect that it should look pretty similar based on what you were showing in the tutorial.
Briefly, the human data (~400,000 cells) was the ref
and I projected the macaque and mouse data (~600,000) onto it (query
). The big gap in the macaque data on the left edge / middle is 100% expected as those are the progenitor populations that don't exist in the macaque data. The colors look a touch noisy as they are cell type labels (NOT clusters).
I'm really excited as when I previously ran some prototype scVI runs on the species separately I was getting some wonky results with the human looking great but the mouse looking odd (not bad, just weird). So I decided to go crazy and try use the human as the good ref and project everything else onto it.
This is super duper exciting as this provides a path where other people can project their data onto mine (as long as they match the kallisto / reference transcriptome) without having to download this huge dataset.
What you might try next is either 1) use scANVI, or 2) train a classifier (any really) on the reference latent space with your cell type labels, as the reference latent representation won't change (if you use tutorial params etc.).
Then, user can project into your dataset and get cell type labels for free without having to compute nearest neighbors.
Ah, that's an excellent point. We do have an xgboost model for cell type prediction that is mostly based on the scVI latent dims (not throwing any shade on scANVI, the base code for this predates scANVI plus we had some odd stuff we were tossing in).
I've been running our benchmarking suite on mouse/macaque projected onto human and it performs better than scVI attempting to integrated all at once. Extra bonus!
I've been referring to this as scVI projection; is there a alternative term that I should be using?
Apologies if this should be a feature request, but given I'm asking about a "beta" feature....
var_names for adata passed in does not match var_names of adata used to train the
model. For valid results, the vars need to be the same and in the same order as the
adata used to train the model.
Is it possible for scvi.model.SCVI.load_query_data
to subset and re-arrange the adata
object to match the var_names
in the scVI model?
This just tripped me up and I imagine other people will get confused.
I'd also argue that the scvi.model.SCVI.load_query_data
should error if the var_names
are missing into the adata
object provided.
Is it possible for scvi.model.SCVI.load_query_data to subset and re-arrange the adata object to match the var_names in the scVI model?
It's possible but I was hesitant to go around manipulating anndata objects in memory. What do you think?
I'd also argue that the scvi.model.SCVI.load_query_data should error if the var_names are missing into the adata object provided.
This also makes sense to me.
I added this as an option -- it will error out if gene names are missing in query.
Is your feature request related to a problem? Please describe. If I've gone through the trouble of making a scVI --> UMAP representation I'd like if I could "keep" the exisiting UMAP projection if I add a (relatively) small number of cells. This is possible with UMAP (https://cran.r-project.org/web/packages/umap/vignettes/umap.html). But this would require the new data to have equivalent scVI latent dimensions.
Describe the solution you'd like Is there a path towards re-using an existing scVI model to incorporate new data?