scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.25k stars 352 forks source link

Issue re-running the scVI query model with the concatenated dataset #880

Closed davemcg closed 3 years ago

davemcg commented 3 years ago

The issue

  1. I build an scVI model
  2. I save it
  3. Later I load it the scVI and pull in new and old data
  4. I (succesfully) run the old data with the scVI model (vae_query)
  5. I concatenate new and old data and try to run the vae_query to extract the latent dims for the full data
  6. I get an error about the categories being wrong

The colab notebook for anyone to run

https://github.com/davemcg/scEiaD/blob/colab/colab/Query_scEiaD_with_scVI.ipynb

Error in question

I'm somewhat convinced I've just done some stupid anndata thing.

INFO     Input adata not setup with scvi. attempting to transfer anndata setup  
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-5e8ea921bcb3> in <module>()
      2 adata_full_HVG = adata_full[:, var_names[0]].copy()
      3 adata_full_HVG.obs['batch'] = adata_full_HVG.obs['scEiaD_batch']
----> 4 vae_query.get_latent_representation(adata_full_HVG)
      5 
      6 #adata_full.obsm['X_scvi'] = vae_query.get_latent_representation(adata_scEiaD_HVG)

5 frames
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py in _make_obs_column_categorical(adata, column_key, alternate_column_key, categorical_dtype)
    690         raise ValueError(
    691             'Making .obs["{}"] categorical failed. Expected categories: {}. '
--> 692             "Received categories: {}. ".format(column_key, mapping, received_categories)
    693         )
    694     adata.obs[alternate_column_key] = codes

ValueError: Making .obs["batch"] categorical failed. Expected categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2'
 'E-MTAB-7316_10xv2_Donor3' 'EGAD00001006350_10xv2_D1_Ch'
 'EGAD00001006350_10xv2_D1_Re' 'EGAD00001006350_10xv2_D3_Ch'
 'EGAD00001006350_10xv2_D3_Re' 'EGAD00001006350_10xv2_D4_Ch'
 'EGAD00001006350_10xv2_D4_Re' 'SRP151023_10xv2_NA' 'SRP170761_10xv2_NA'
 'SRP222001_10xv2_retina1' 'SRP222001_10xv2_retina2'
 'SRP222001_10xv2_retina3' 'SRP222958_DropSeq_retina2'
 'SRP222958_DropSeq_retina6' 'SRP222958_DropSeq_retina8'
 'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2' 'SRP238587_10xv2_NA'
 'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2' 'SRP255195_10xv2_H3'
 'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5' 'SRR12130660']. Received categories: Index(['E-MTAB-7316_10xv2_Donor1', 'E-MTAB-7316_10xv2_Donor2',
       'E-MTAB-7316_10xv2_Donor3', 'EGAD00001006350_10xv2_D1_Ch',
       'EGAD00001006350_10xv2_D1_Re', 'EGAD00001006350_10xv2_D3_Ch',
       'EGAD00001006350_10xv2_D3_Re', 'EGAD00001006350_10xv2_D4_Ch',
       'EGAD00001006350_10xv2_D4_Re', 'OGVFB_Hufnagel_iPSC_RPE_10xv2_None',
       'SRP050054_DropSeq_retina1', 'SRP050054_DropSeq_retina2',
       'SRP050054_DropSeq_retina3', 'SRP050054_DropSeq_retina4',
       'SRP050054_DropSeq_retina5', 'SRP050054_DropSeq_retina6',
       'SRP050054_DropSeq_retina7', 'SRP073242_SMARTSeq_v2_NA',
       'SRP075719_DropSeq_Batch1', 'SRP075719_DropSeq_Batch2',
       'SRP075720_SMARTSeq_v2_Batch1', 'SRP075720_SMARTSeq_v2_Batch2',
       'SRP106476_SMARTerSeq_v3_NA', 'SRP131661_10xv2_3-F-56',
       'SRP131661_10xv2_3-F-57', 'SRP131661_10xv2_3-M-5/6',
       'SRP131661_10xv2_3-M-7/8', 'SRP131661_10xv2_3-M-8',
       'SRP131661_10xv2_3-M-8/9', 'SRP131661_10xv2_3-M-9',
       'SRP136739_SMARTSeq_v4_NA', 'SRP151023_10xv2_NA',
       'SRP157927_10xv2_Macaque1', 'SRP157927_10xv2_Macaque2',
       'SRP157927_10xv2_Macaque3', 'SRP157927_10xv2_Macaque4',
       'SRP158081_10xv2_Rep1', 'SRP158081_10xv2_Rep2', 'SRP158081_10xv2_Rep3',
       'SRP158081_SMARTSeq_v2_Rep1', 'SRP158528_10xv2_Macaque1',
       'SRP158528_10xv2_Macaque2', 'SRP158528_10xv2_Macaque3',
       'SRP158528_10xv2_Macaque4', 'SRP159286_SCRBSeq_NA',
       'SRP161678_SMARTSeq_v4_NA', 'SRP166660_10xv2_run1',
       'SRP166660_10xv2_run2', 'SRP168426_10xv2_E2', 'SRP168426_10xv2_F2',
       'SRP170038_SMARTSeq_v2_NA', 'SRP170761_10xv2_NA',
       'SRP186396_SMARTSeq_v2_NA', 'SRP186407_10xv2_NA',
       'SRP194595_10xv3_Donor1', 'SRP194595_10xv3_Donor2',
       'SRP194595_10xv3_Donor3', 'SRP200599_10xv2_NA',
       'SRP212151_10xv2_Batch1', 'SRP212151_10xv2_Batch2',
       'SRP212151_10xv2_Batch3', 'SRP218652_10xv3_donor1',
       'SRP218652_10xv3_donor2', 'SRP218652_10xv3_donor3',
       'SRP218652_10xv3_donor4', 'SRP218652_10xv3_donor5',
       'SRP218652_10xv3_donor6', 'SRP218652_10xv3_donor7',
       'SRP222001_10xv2_retina1', 'SRP222001_10xv2_retina2',
       'SRP222001_10xv2_retina3', 'SRP222958_DropSeq_retina2',
       'SRP222958_DropSeq_retina6', 'SRP222958_DropSeq_retina8',
       'SRP223254_10xv2_NA', 'SRP223254_10xv2_rep2', 'SRP238587_10xv2_NA',
       'SRP255195_10xv2_H1', 'SRP255195_10xv2_H2', 'SRP255195_10xv2_H3',
       'SRP255195_10xv2_H4', 'SRP255195_10xv2_H5', 'SRP255195_10xv3_H1',
       'SRP257883_10xv3_donor_22', 'SRP257883_10xv3_donor_23',
       'SRP257883_10xv3_donor_24', 'SRP257883_10xv3_donor_25', 'SRR12130660'],
      dtype='object'). 

Versions:

0.8.1

VERSION

davemcg commented 3 years ago

Oh fudge. The "reference" data is a mix of data that was used to train the scVI model and a mix of data it hasn't seen before. I was getting all fixated on the "Index" thing that I'm guessing the actual issue is that scVI is angry that there are batch types that are not in the existing model.....

davemcg commented 3 years ago

OK, so you run scVI once:

  1. vae_ref
  2. Save vae_ref

You get more data

  1. vae_query by loading vae_ref with new query data

My question is: Can you then save vae_query and continue the process?

e.g. Get even more data in the future

  1. vae_query2 by loading vae_query with new(er) query data
adamgayoso commented 3 years ago

My question is: Can you then save vae_query and continue the process?

Yes you can (though we haven't explicitly tested it, but it should work).

But recall that the core part of the model didn't change. So the outcome of vae_query2 update would be the same if you just used the new query data the first time. In other words, the model is like 99% (don't quote me here) frozen and you're projecting into the reference latent space. And in other other words, the first query update didn't "improve" the model.

davemcg commented 3 years ago

https://colab.research.google.com/github/davemcg/scEiaD/blob/colab/Query_scEiaD_with_scVI.ipynb

OK, I've updated the document. I'm still having a problem where scVI is complaining about the categorical labels being in the wrong order. I "worked around" that by just running get_latent_dimensions for the "ref" and the "query" separately. Then I glue the anndata objects together.

Is this "kosher"?

adata_scEiaD_ref_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_scEiaD_ref_HVG)
adata_query_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_query_HVG)
adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')

I know I'm supposed to something like this:

adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')
scvi.data.setup_anndata(adata_full_HVG, batch_key="batch")
adata_full_HVG['X_scvi'] = vae_query.get_latent_representation(adata_full_HVG)

But I'm getting this error:

INFO     Using batches from adata.obs["batch"]                                  
INFO     No label_key inputted, assuming all cells have same label              
INFO     Using data from adata.X                                                
INFO     Computing library size prior per batch                                 
INFO     Successfully registered anndata object containing 50674 cells, 5000    
         vars, 81 batches, 1 labels, and 0 proteins. Also registered 0 extra    
         categorical covariates and 0 extra continuous covariates.              
INFO     Please do not further modify adata until model is trained.             
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-a9d4466488fd> in <module>()
      1 adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')
      2 scvi.data.setup_anndata(adata_full_HVG, batch_key="batch")
----> 3 adata_full_HVG['X_scvi'] = vae_query.get_latent_representation(adata_full_HVG)

4 frames
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py in _needs_transfer(mapping1, mapping2, category)
   1194         logger.warning(warning_msg.format(category, mapping1, mapping2))
   1195     else:
-> 1196         raise ValueError(error_msg.format(category, mapping1, mapping2))
   1197     return needs_transfer
   1198 

ValueError: Categorial encoding for batch is not the same between the anndata used to train and the anndata passed in. Categorical encoding needs to be same elements, same order, and same datatype.
Expected categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2'
 'E-MTAB-7316_10xv2_Donor3' 'EGAD00001006350_10xv2_D1_Ch'
 'EGAD00001006350_10xv2_D1_Re' 'EGAD00001006350_10xv2_D3_Ch'
 'EGAD00001006350_10xv2_D3_Re' 'EGAD00001006350_10xv2_D4_Ch'
 'EGAD00001006350_10xv2_D4_Re' 'SRP151023_10xv2_NA' 'SRP170761_10xv2_NA'
 'SRP222001_10xv2_retina1' 'SRP222001_10xv2_retina2'
 'SRP222001_10xv2_retina3' 'SRP222958_DropSeq_retina2'
 'SRP222958_DropSeq_retina6' 'SRP222958_DropSeq_retina8'
 'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2' 'SRP238587_10xv2_NA'
 'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2' 'SRP255195_10xv2_H3'
 'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5'
 'OGVFB_Hufnagel_iPSC_RPE_10xv2_None' 'SRP050054_DropSeq_retina1'
 'SRP050054_DropSeq_retina2' 'SRP050054_DropSeq_retina3'
 'SRP050054_DropSeq_retina4' 'SRP050054_DropSeq_retina5'
 'SRP050054_DropSeq_retina6' 'SRP050054_DropSeq_retina7'
 'SRP073242_SMARTSeq_v2_NA' 'SRP075719_DropSeq_Batch1'
 'SRP075719_DropSeq_Batch2' 'SRP075720_SMARTSeq_v2_Batch1'
 'SRP075720_SMARTSeq_v2_Batch2' 'SRP106476_SMARTerSeq_v3_NA'
 'SRP131661_10xv2_3-F-56' 'SRP131661_10xv2_3-F-57'
 'SRP131661_10xv2_3-M-5/6' 'SRP131661_10xv2_3-M-7/8'
 'SRP131661_10xv2_3-M-8' 'SRP131661_10xv2_3-M-8/9' 'SRP131661_10xv2_3-M-9'
 'SRP136739_SMARTSeq_v4_NA' 'SRP157927_10xv2_Macaque1'
 'SRP157927_10xv2_Macaque2' 'SRP157927_10xv2_Macaque3'
 'SRP157927_10xv2_Macaque4' 'SRP158081_10xv2_Rep1' 'SRP158081_10xv2_Rep2'
 'SRP158081_10xv2_Rep3' 'SRP158081_SMARTSeq_v2_Rep1'
 'SRP158528_10xv2_Macaque1' 'SRP158528_10xv2_Macaque2'
 'SRP158528_10xv2_Macaque3' 'SRP158528_10xv2_Macaque4'
 'SRP159286_SCRBSeq_NA' 'SRP161678_SMARTSeq_v4_NA' 'SRP166660_10xv2_run1'
 'SRP166660_10xv2_run2' 'SRP168426_10xv2_E2' 'SRP168426_10xv2_F2'
 'SRP170038_SMARTSeq_v2_NA' 'SRP186396_SMARTSeq_v2_NA'
 'SRP186407_10xv2_NA' 'SRP194595_10xv3_Donor1' 'SRP194595_10xv3_Donor2'
 'SRP194595_10xv3_Donor3' 'SRP200599_10xv2_NA' 'SRP212151_10xv2_Batch1'
 'SRP212151_10xv2_Batch2' 'SRP212151_10xv2_Batch3'
 'SRP218652_10xv3_donor1' 'SRP218652_10xv3_donor2'
 'SRP218652_10xv3_donor3' 'SRP218652_10xv3_donor4'
 'SRP218652_10xv3_donor5' 'SRP218652_10xv3_donor6'
 'SRP218652_10xv3_donor7' 'SRP255195_10xv3_H1' 'SRP257883_10xv3_donor_22'
 'SRP257883_10xv3_donor_23' 'SRP257883_10xv3_donor_24'
 'SRP257883_10xv3_donor_25' 'SRR12130660']. Received categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2'
 'E-MTAB-7316_10xv2_Donor3' 'EGAD00001006350_10xv2_D1_Ch'
 'EGAD00001006350_10xv2_D1_Re' 'EGAD00001006350_10xv2_D3_Ch'
 'EGAD00001006350_10xv2_D3_Re' 'EGAD00001006350_10xv2_D4_Ch'
 'EGAD00001006350_10xv2_D4_Re' 'OGVFB_Hufnagel_iPSC_RPE_10xv2_None'
 'SRP050054_DropSeq_retina1' 'SRP050054_DropSeq_retina2'
 'SRP050054_DropSeq_retina3' 'SRP050054_DropSeq_retina4'
 'SRP050054_DropSeq_retina5' 'SRP050054_DropSeq_retina6'
 'SRP050054_DropSeq_retina7' 'SRP073242_SMARTSeq_v2_NA'
 'SRP075719_DropSeq_Batch1' 'SRP075719_DropSeq_Batch2'
 'SRP075720_SMARTSeq_v2_Batch1' 'SRP075720_SMARTSeq_v2_Batch2'
 'SRP106476_SMARTerSeq_v3_NA' 'SRP136739_SMARTSeq_v4_NA'
 'SRP151023_10xv2_NA' 'SRP157927_10xv2_Macaque1'
 'SRP157927_10xv2_Macaque2' 'SRP157927_10xv2_Macaque3'
 'SRP157927_10xv2_Macaque4' 'SRP158081_10xv2_Rep1' 'SRP158081_10xv2_Rep2'
 'SRP158081_10xv2_Rep3' 'SRP158081_SMARTSeq_v2_Rep1'
 'SRP158528_10xv2_Macaque1' 'SRP158528_10xv2_Macaque2'
 'SRP158528_10xv2_Macaque3' 'SRP158528_10xv2_Macaque4'
 'SRP159286_SCRBSeq_NA' 'SRP161678_SMARTSeq_v4_NA' 'SRP166660_10xv2_run1'
 'SRP166660_10xv2_run2' 'SRP168426_10xv2_E2' 'SRP168426_10xv2_F2'
 'SRP170038_SMARTSeq_v2_NA' 'SRP170761_10xv2_NA'
 'SRP186396_SMARTSeq_v2_NA' 'SRP186407_10xv2_NA' 'SRP194595_10xv3_Donor1'
 'SRP194595_10xv3_Donor2' 'SRP194595_10xv3_Donor3' 'SRP200599_10xv2_NA'
 'SRP212151_10xv2_Batch1' 'SRP212151_10xv2_Batch2'
 'SRP212151_10xv2_Batch3' 'SRP218652_10xv3_donor1'
 'SRP218652_10xv3_donor2' 'SRP218652_10xv3_donor3'
 'SRP218652_10xv3_donor4' 'SRP218652_10xv3_donor5'
 'SRP218652_10xv3_donor6' 'SRP218652_10xv3_donor7'
 'SRP222001_10xv2_retina1' 'SRP222001_10xv2_retina2'
 'SRP222001_10xv2_retina3' 'SRP222958_DropSeq_retina2'
 'SRP222958_DropSeq_retina6' 'SRP222958_DropSeq_retina8'
 'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2' 'SRP238587_10xv2_NA'
 'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2' 'SRP255195_10xv2_H3'
 'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5' 'SRP255195_10xv3_H1'
 'SRP257883_10xv3_donor_22' 'SRP257883_10xv3_donor_23'
 'SRP257883_10xv3_donor_24' 'SRP257883_10xv3_donor_25' 'SRR12130660'].
davemcg commented 3 years ago

My question is: Can you then save vae_query and continue the process?

Yes you can (though we haven't explicitly tested it, but it should work).

But recall that the core part of the model didn't change. So the outcome of vae_query2 update would be the same if you just used the new query data the first time. In other words, the model is like 99% (don't quote me here) frozen and you're projecting into the reference latent space. And in other other words, the first query update didn't "improve" the model.

Oops, thought I responded to this! Yes, thank you. I obviously don't fully follow what is happening with vae_query run...

adamgayoso commented 3 years ago

Please try this:

adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey')
del adata_full_HVG.uns["_scvi"]
adata_full_HVG['X_scvi'] = vae_query.get_latent_representation(adata_full_HVG)
adamgayoso commented 3 years ago
adata_scEiaD_ref_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_scEiaD_ref_HVG)
adata_query_HVG.obsm['X_scvi'] = vae_query.get_latent_representation(adata_query_HVG)
adata_full_HVG = adata_query_HVG.concatenate(adata_scEiaD_ref_HVG, batch_key = 'bkey'

This also works out.

The issue overall is partially due to anndata concatenation and how our setup dictionary is stored in a concatenated object.

We should add an option to our get... functions that allows you to force the anndata to be setup appropriately.

davemcg commented 3 years ago

Great thanks. It seems that del adata_full_HVG.uns["_scvi"] is the key line. So this is an index? And should it be missing then it just gets regenerated?

adamgayoso commented 3 years ago

It rebuilds our setup dictionary in the anndata, but I just realized the actual issue is that version 0.8.0 is pinned, where this issue was fixed in 0.8.1. Therefore the original code should work

adamgayoso commented 3 years ago

A few other things:

  1. You don't need to setup the query data explicitly before load query data
  2. You should train query for many more epochs
adamgayoso commented 3 years ago

Going to close this. @davemcg please make sure you use 0.8.1 to avoid this issue!