scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.23k stars 350 forks source link

Semi-supervised integration with scANVI #698

Closed krejciadam closed 4 years ago

krejciadam commented 4 years ago

Dear scVI authors, I was trying to use scANVI for semi-supervised data integration, as it seemed to be the first tool ever to be able to perform something like that (as far as I'm aware of). Unfortunately, I failed miserably, so I'm wondering whether I'm doing something wrong or whether the concept of semi-supervised integration is not supported after all.

The concept: What I mean by semi-supervised integration is this: I have 2 datasets.

Dataset A contains: 1) control cells 2) cells exposed to various conditions, possibly very diverse populations

Dataset B contains: 1) control cells 2) cells exposed to various conditions, different from dataset A, possibly very diverse populations.

I know which cells are control cells and which are other, because I've used cell hashing to mark them in the experiment. So each of the datasets contains a subset of specified control cells, that are supposed to have very similar transcriptomes across the two experiments, plus a very diverse set of cells that might be very different from anything in the other dataset. In the case of the following example, especially batch1 contains several very distinct populations of cells.

I would like to utilize the information on which cells are control cells to help the integration - I would like to use the control cells only to drive the integration process, i.e. serve as "anchors" to say.

This is what I do:

#there are 2 batches in the dataset
scanvi = SCANVI(dataset.nb_genes, dataset.n_batches, dataset.n_labels)
trainer_scanvi = SemiSupervisedTrainer(scanvi, dataset, frequency=5)

#label 0 = control cells
#label 1 = any other cells, possibly very diverse.
#only control cells are considered labeled :

trainer_scanvi.labelled_set = trainer_scanvi.create_posterior(indices=(dataset.labels == 0).ravel())
trainer_scanvi.labelled_set.to_monitor = ['reconstruction_error', 'accuracy']
trainer_scanvi.unlabelled_set = trainer_scanvi.create_posterior(indices=(dataset.labels == 1).ravel())
trainer_scanvi.unlabelled_set.to_monitor = ['reconstruction_error', 'accuracy']

trainer_scanvi.train(n_epochs=600)

I'm not interested in labels (obviously, as label transfer is not possible at all with my settings). I'd only like to get a well integrated latent space. So I take the latent space and plot an UMAP of it:

full = trainer_scanvi.full_dataset
latent, batch_indices, labels = full.sequential().get_latent()
latent_u = UMAP(spread=2).fit_transform(latent)

Figure_2

I would expect batch0_ctrl and batch1_ctrl cells to be aligned well, which they are not.

In fact, running an unsupervised VAE without any external information does a somewhat better job:

vae = VAE(dataset.nb_genes, n_batch=dataset.n_batches, n_labels=2,
          n_hidden=128, n_latent=30, n_layers=2, dispersion='gene')
trainer = UnsupervisedTrainer(vae, dataset, train_size=1.0, n_epochs_kl_warmup=300)
n_epochs = 500
trainer.train(n_epochs=n_epochs)

full = trainer.create_posterior(trainer.model, merged, indices=np.arange(len(merged)))
latent, batch_indices, labels = full.sequential().get_latent()
latent_u = UMAP(spread=2).fit_transform(latent)

Figure_1

This result is qualitatively similar to the results of other common unsupervised integration methods on this dataset, i.e. sort of OK, but not really perfect (Not just based on the UMAP - I actually measured this, I'll skip the details in this post). You can see that the control cells blend together better while distinct batch1 populations stay distinct. I was hoping though that the scANVI approach could beat this result.

I played with the model parameters a lot, but nothing seems to really improve the results. So I was wondering - am I doing this incorrectly? And is such supervised integration even possible with scVI/scANVI?

Thanks a lot! Adam

galenxing commented 4 years ago

Hi @krejciadam! Thank you for your interest in scVI!

I think scanVI might not be the tool for what you're looking to do. scanVI is really for labeling unannotated cells. Eg, if you had cell type labels for 10% of your cells and wanted to label the remaining 90%.

What I can suggest - which isn't very statistically sound and definitely not best practice - but might work depending on your use case, is to train scVI on your combined dataset (Dataset A + Dataset B), without any batch correction (n_batch = 0). Get the latent spaces for Dataset A control as well as Dataset B control, then find the transformation between the centroids of each space. Then apply that transformation to your case cells.

The following resources might be useful: https://www.nature.com/articles/s41592-019-0494-8 Specifically the part on δ vector estimation in the Appendix. https://www.nature.com/articles/s41592-019-0494-8/figures/15 http://proceedings.mlr.press/v108/martens20a/martens20a.pdf

Goodluck! Galen

romain-lopez commented 4 years ago

Hi @krejciadam, thank for your issue. Let me add a bit more insight to what Galen proposed.

First, there is a bug in scANVI since May that we just saw / solved last week [1]. @galenxing what is the estimated timeline for the release of the new version?

I would expect batch0_ctrl and batch1_ctrl cells to be aligned well, which they are not.

As Galen emphasized, the purpose of scANVI is mostly for annotations. However, we did investigate scANVI's latent space in our applications. Still, there is a specific reason for which scANVI, as you applied it here, will not work well. If you have a closer look at the manuscript, scANVI's generative model use the labels to construct a Gaussian mixture model prior for the latent space. In your case, if you put the cells into two Gaussian blobs only, it makes sense you don't get as good as a result than scVI or other method, whose prior is more agnostic. To solve this problem, you might consider adding more "n_labels" (or clusters to your GMM prior). It is not obvious that scANVI will make good use of labels that are never observed (this is a hard problem), but it's worth trying.

And is such supervised integration even possible with scVI/scANVI?

We did try this type of scenario on a purified PBMC dataset from 10x in the scANVI manuscript [2]. Of course, this is a much easier setting because cell types are much easier to capture in PBMCs. However, in this semi-simulations we knew exactly the number of cell types in each dataset and we knew we were adding xx more cell types. Can you label somehow one of the two datasets? Or at least label all the overlapping cells into cell types? That would be a much more useful information for scANVI and it would alter the latent space significantly less.

Since you have already all this code snippet ready, I suggest you try these options. I'm curious so please let us know of what you find!

[1] https://github.com/YosefLab/scVI/issues/688 [2] Section: "Harmonizing datasets with a different composition of cell types" of scANVI manuscript

galenxing commented 4 years ago

Thanks @romain-lopez! New release should go live tomorrow, but the scanVI fix is already in master, so @krejciadam you can also install from source and you'll get the changes.

krejciadam commented 4 years ago

Hi all! Thanks for your detailed replies and suggestions!

@galenxing Sure! I know scGen of course and I've tried a slightly modified version of what you suggested a while ago on a homebrew vanilla VAE (working on scaled data). The results were not bad, both in latent space and reconstructed original space. Definitely better than anything I've managed with any unsupervised tool. I'm yet to try this with your implementation of VAE. The authors of scGen actually have an interesting followup paper, in which they use a VAE with MMD layer. https://arxiv.org/pdf/1910.01791.pdf Haven't tried this on my data yet, but my feeling is the dataset as I described it here does not fit this concept either, because the meaning of label "other" in my case is not really a single population of cells.

@romain-lopez I understand the point now. I wanted to leverage the possibility to leave my diverse sets of "other" cells unlabeled, effectively only adding the information about which cells are control to an otherwise unsupervised scenario, but I see now this is not how it works. About adding more labeled populations: In this case, the control cells are a cell line, so most likely the only reasonable splitting/clustering is by cell cycle phase. I have datasets though where I could add more labels to the non-control cells, like:

dataset1 with control cells + cells treated with X different drugs dataset2 with control cells and cells treated with yet another set of drugs

where cells with each drug treatment are separately marked. So the only overlap between sets of labels in datsaet1 and datsaet2 would be "control", while the effects of different drug treatments might be both similar or very different from one another, possibly also similar to control (drug does nothing). I'll experiment with this scenario a bit!

adamgayoso commented 4 years ago

Please feel free to reopen if you have any further questions or results!