Closed davemcg closed 3 years ago
Oh fudge. The "reference" data is a mix of data that was used to train the scVI model and a mix of data it hasn't seen before. I was getting all fixated on the "Index" thing that I'm guessing the actual issue is that scVI is angry that there are batch
types that are not in the existing model.....
OK, so you run scVI once:
You get more data
My question is: Can you then save vae_query and continue the process?
e.g. Get even more data in the future
My question is: Can you then save vae_query and continue the process?
Yes you can (though we haven't explicitly tested it, but it should work).
But recall that the core part of the model didn't change. So the outcome of vae_query2 update would be the same if you just used the new query data the first time. In other words, the model is like 99% (don't quote me here) frozen and you're projecting into the reference latent space. And in other other words, the first query update didn't "improve" the model.
https://colab.research.google.com/github/davemcg/scEiaD/blob/colab/Query_scEiaD_with_scVI.ipynb
OK, I've updated the document. I'm still having a problem where scVI is complaining about the categorical labels being in the wrong order. I "worked around" that by just running get_latent_dimensions
for the "ref" and the "query" separately. Then I glue the anndata objects together.
Is this "kosher"?
adata_scEiaD_ref_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_scEiaD_ref_HVG)
adata_query_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_query_HVG)
adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')
I know I'm supposed to something like this:
adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')
scvi.data.setup_anndata(adata_full_HVG, batch_key="batch")
adata_full_HVG['X_scvi'] = vae_query.get_latent_representation(adata_full_HVG)
But I'm getting this error:
INFO Using batches from adata.obs["batch"]
INFO No label_key inputted, assuming all cells have same label
INFO Using data from adata.X
INFO Computing library size prior per batch
INFO Successfully registered anndata object containing 50674 cells, 5000
vars, 81 batches, 1 labels, and 0 proteins. Also registered 0 extra
categorical covariates and 0 extra continuous covariates.
INFO Please do not further modify adata until model is trained.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-a9d4466488fd> in <module>()
1 adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')
2 scvi.data.setup_anndata(adata_full_HVG, batch_key="batch")
----> 3 adata_full_HVG['X_scvi'] = vae_query.get_latent_representation(adata_full_HVG)
4 frames
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py in _needs_transfer(mapping1, mapping2, category)
1194 logger.warning(warning_msg.format(category, mapping1, mapping2))
1195 else:
-> 1196 raise ValueError(error_msg.format(category, mapping1, mapping2))
1197 return needs_transfer
1198
ValueError: Categorial encoding for batch is not the same between the anndata used to train and the anndata passed in. Categorical encoding needs to be same elements, same order, and same datatype.
Expected categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2'
'E-MTAB-7316_10xv2_Donor3' 'EGAD00001006350_10xv2_D1_Ch'
'EGAD00001006350_10xv2_D1_Re' 'EGAD00001006350_10xv2_D3_Ch'
'EGAD00001006350_10xv2_D3_Re' 'EGAD00001006350_10xv2_D4_Ch'
'EGAD00001006350_10xv2_D4_Re' 'SRP151023_10xv2_NA' 'SRP170761_10xv2_NA'
'SRP222001_10xv2_retina1' 'SRP222001_10xv2_retina2'
'SRP222001_10xv2_retina3' 'SRP222958_DropSeq_retina2'
'SRP222958_DropSeq_retina6' 'SRP222958_DropSeq_retina8'
'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2' 'SRP238587_10xv2_NA'
'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2' 'SRP255195_10xv2_H3'
'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5'
'OGVFB_Hufnagel_iPSC_RPE_10xv2_None' 'SRP050054_DropSeq_retina1'
'SRP050054_DropSeq_retina2' 'SRP050054_DropSeq_retina3'
'SRP050054_DropSeq_retina4' 'SRP050054_DropSeq_retina5'
'SRP050054_DropSeq_retina6' 'SRP050054_DropSeq_retina7'
'SRP073242_SMARTSeq_v2_NA' 'SRP075719_DropSeq_Batch1'
'SRP075719_DropSeq_Batch2' 'SRP075720_SMARTSeq_v2_Batch1'
'SRP075720_SMARTSeq_v2_Batch2' 'SRP106476_SMARTerSeq_v3_NA'
'SRP131661_10xv2_3-F-56' 'SRP131661_10xv2_3-F-57'
'SRP131661_10xv2_3-M-5/6' 'SRP131661_10xv2_3-M-7/8'
'SRP131661_10xv2_3-M-8' 'SRP131661_10xv2_3-M-8/9' 'SRP131661_10xv2_3-M-9'
'SRP136739_SMARTSeq_v4_NA' 'SRP157927_10xv2_Macaque1'
'SRP157927_10xv2_Macaque2' 'SRP157927_10xv2_Macaque3'
'SRP157927_10xv2_Macaque4' 'SRP158081_10xv2_Rep1' 'SRP158081_10xv2_Rep2'
'SRP158081_10xv2_Rep3' 'SRP158081_SMARTSeq_v2_Rep1'
'SRP158528_10xv2_Macaque1' 'SRP158528_10xv2_Macaque2'
'SRP158528_10xv2_Macaque3' 'SRP158528_10xv2_Macaque4'
'SRP159286_SCRBSeq_NA' 'SRP161678_SMARTSeq_v4_NA' 'SRP166660_10xv2_run1'
'SRP166660_10xv2_run2' 'SRP168426_10xv2_E2' 'SRP168426_10xv2_F2'
'SRP170038_SMARTSeq_v2_NA' 'SRP186396_SMARTSeq_v2_NA'
'SRP186407_10xv2_NA' 'SRP194595_10xv3_Donor1' 'SRP194595_10xv3_Donor2'
'SRP194595_10xv3_Donor3' 'SRP200599_10xv2_NA' 'SRP212151_10xv2_Batch1'
'SRP212151_10xv2_Batch2' 'SRP212151_10xv2_Batch3'
'SRP218652_10xv3_donor1' 'SRP218652_10xv3_donor2'
'SRP218652_10xv3_donor3' 'SRP218652_10xv3_donor4'
'SRP218652_10xv3_donor5' 'SRP218652_10xv3_donor6'
'SRP218652_10xv3_donor7' 'SRP255195_10xv3_H1' 'SRP257883_10xv3_donor_22'
'SRP257883_10xv3_donor_23' 'SRP257883_10xv3_donor_24'
'SRP257883_10xv3_donor_25' 'SRR12130660']. Received categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2'
'E-MTAB-7316_10xv2_Donor3' 'EGAD00001006350_10xv2_D1_Ch'
'EGAD00001006350_10xv2_D1_Re' 'EGAD00001006350_10xv2_D3_Ch'
'EGAD00001006350_10xv2_D3_Re' 'EGAD00001006350_10xv2_D4_Ch'
'EGAD00001006350_10xv2_D4_Re' 'OGVFB_Hufnagel_iPSC_RPE_10xv2_None'
'SRP050054_DropSeq_retina1' 'SRP050054_DropSeq_retina2'
'SRP050054_DropSeq_retina3' 'SRP050054_DropSeq_retina4'
'SRP050054_DropSeq_retina5' 'SRP050054_DropSeq_retina6'
'SRP050054_DropSeq_retina7' 'SRP073242_SMARTSeq_v2_NA'
'SRP075719_DropSeq_Batch1' 'SRP075719_DropSeq_Batch2'
'SRP075720_SMARTSeq_v2_Batch1' 'SRP075720_SMARTSeq_v2_Batch2'
'SRP106476_SMARTerSeq_v3_NA' 'SRP136739_SMARTSeq_v4_NA'
'SRP151023_10xv2_NA' 'SRP157927_10xv2_Macaque1'
'SRP157927_10xv2_Macaque2' 'SRP157927_10xv2_Macaque3'
'SRP157927_10xv2_Macaque4' 'SRP158081_10xv2_Rep1' 'SRP158081_10xv2_Rep2'
'SRP158081_10xv2_Rep3' 'SRP158081_SMARTSeq_v2_Rep1'
'SRP158528_10xv2_Macaque1' 'SRP158528_10xv2_Macaque2'
'SRP158528_10xv2_Macaque3' 'SRP158528_10xv2_Macaque4'
'SRP159286_SCRBSeq_NA' 'SRP161678_SMARTSeq_v4_NA' 'SRP166660_10xv2_run1'
'SRP166660_10xv2_run2' 'SRP168426_10xv2_E2' 'SRP168426_10xv2_F2'
'SRP170038_SMARTSeq_v2_NA' 'SRP170761_10xv2_NA'
'SRP186396_SMARTSeq_v2_NA' 'SRP186407_10xv2_NA' 'SRP194595_10xv3_Donor1'
'SRP194595_10xv3_Donor2' 'SRP194595_10xv3_Donor3' 'SRP200599_10xv2_NA'
'SRP212151_10xv2_Batch1' 'SRP212151_10xv2_Batch2'
'SRP212151_10xv2_Batch3' 'SRP218652_10xv3_donor1'
'SRP218652_10xv3_donor2' 'SRP218652_10xv3_donor3'
'SRP218652_10xv3_donor4' 'SRP218652_10xv3_donor5'
'SRP218652_10xv3_donor6' 'SRP218652_10xv3_donor7'
'SRP222001_10xv2_retina1' 'SRP222001_10xv2_retina2'
'SRP222001_10xv2_retina3' 'SRP222958_DropSeq_retina2'
'SRP222958_DropSeq_retina6' 'SRP222958_DropSeq_retina8'
'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2' 'SRP238587_10xv2_NA'
'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2' 'SRP255195_10xv2_H3'
'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5' 'SRP255195_10xv3_H1'
'SRP257883_10xv3_donor_22' 'SRP257883_10xv3_donor_23'
'SRP257883_10xv3_donor_24' 'SRP257883_10xv3_donor_25' 'SRR12130660'].
My question is: Can you then save vae_query and continue the process?
Yes you can (though we haven't explicitly tested it, but it should work).
But recall that the core part of the model didn't change. So the outcome of vae_query2 update would be the same if you just used the new query data the first time. In other words, the model is like 99% (don't quote me here) frozen and you're projecting into the reference latent space. And in other other words, the first query update didn't "improve" the model.
Oops, thought I responded to this! Yes, thank you. I obviously don't fully follow what is happening with vae_query
run...
Please try this:
adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')
del adata_full_HVG.uns["_scvi"]
adata_full_HVG['X_scvi'] = vae_query.get_latent_representation(adata_full_HVG)
adata_scEiaD_ref_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_scEiaD_ref_HVG)
adata_query_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_query_HVG)
adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey'
This also works out.
The issue overall is partially due to anndata concatenation and how our setup dictionary is stored in a concatenated object.
We should add an option to our get...
functions that allows you to force the anndata to be setup appropriately.
Great thanks. It seems that del adata_full_HVG.uns["_scvi"]
is the key line. So this is an index? And should it be missing then it just gets regenerated?
It rebuilds our setup dictionary in the anndata, but I just realized the actual issue is that version 0.8.0 is pinned, where this issue was fixed in 0.8.1. Therefore the original code should work
A few other things:
Going to close this. @davemcg please make sure you use 0.8.1 to avoid this issue!
The issue
vae_query
)vae_query
to extract the latent dims for the full dataThe colab notebook for anyone to run
Error in question
I'm somewhat convinced I've just done some stupid anndata thing.
Versions:
0.8.1