scVI change corrected counts for tasks

canergen commented 7 months ago

Hi, Thanks for putting the benchmark together. I'm happy to have a look why the results for integration are that different from e.g. OpenProblems https://openproblems.bio/results/batch_integration_embed/. It might help to use a more recent version (the one used in the environment is very outdated). I share your concerns with the presented results but am not certain why the performance is that terrible. The code looks correct. However, I'm mainly posting as the use of corrected counts is misplaced for scVI. We would highly recommend using a posterior predictive sample for the DE tests performed here. Those are still imputed but are samples from the generative distribution instead of the mean of the generative distribution. If we recommend using get_normalized counts somewhere in our tutorials/documentation, it would be helpful to get that as feedback. pp_counts = model.posterior_predictive_sample( model.adata, n_samples=1, # makes it most comparable with sparsity of input data. )

Nusob888 commented 6 months ago

@canergen I am keen to see what the outcome is. Could you keep this thread updated?

Am I right in my interpretation that the design seems to be quite different between the open problems and this benchmark?

The open problems benchmark seems to use actual dataset batches whereas this uses a single dataset split into artificial batches.

I agree with the authors that any integration should not introduce changes in cluster assignments if all “synthetic” batches are devoid of any true batch effects. The only caveat I see of using the datasets here is that the cell numbers per batch would be quite small so could scvi be underperforming due to tiny train/validation splits?

Also, curious to know if the authors considered using two identical cellxgene matrices as artificial batches? The random sampling process would still generate differences in cell composition that can be considered a true sampling “batch” effect. Therefore overly harsh integration methods such as seurat and harmony may be performing better purely in this scenario, but may perform less well conceptually when over integration isn’t desired. Eg batches representing different tissue sources etc.

baroona commented 6 months ago

@canergen Thank you for your comments. Is there any published material on any updates made in new versions of scVI, compared to what we use?

On your points with the posterior predictive samples and fixing the seed. Those are very important points and we will update that in our workflow and update results to reflect changes, if any, that we find

@Nusob888 Thank you for your comments. Our mouse brain data has 20k cells for example. We think that the data we chose represents well the types of data that researchers are using these methods on. At one point in our process I did use two identical matrices. It introduced a lot of strange behaviour and could be criticised as a simulation strategy in various ways, so we didn't go further with the idea.

canergen commented 6 months ago

Hi @baroona . https://docs.scvi-tools.org/en/stable/release_notes/index.html are our release notes. One of the changes is that in scVI<1.0.0 the seed is fixed (so you did everything correct). I wouldn't expect strong variation of results when updating scvi-tools but it generally makes your and our life easier to stick to a recent version. We made the point with varying seeds in our Colab notebook (https://colab.research.google.com/drive/15k5kahlT5qSdXNUdkqeEj6je8-nbR_Vh?usp=sharing) to put conservation of NN into perspective. Adding a random batch_key might have an impact (especially when using encode_covariates=True as used in some of our notebooks, e.g. in https://docs.scvi-tools.org/en/stable/tutorials/notebooks/scrna/scarches_scvi_tools.html). However, the effect of retraining with another random seed is larger. I tried encoding covariates out in our Colab notebook and found no strong effects (results not included in the notebook to keep it simple). It's generally helpful to train scVI with large datasets and this increases conservation of NN with different random seeds (I tried out scvi.dataset.retina() ). However, I think it's fair to use 5k cells in the benchmarking. I also don't think that the process of adding a random batch key is producing biased results. The major bias comes from the fact that the same PCA is used for e.g. harmony and the raw data. This favours results for harmony (when it does no correction you get perfect conservation). We are more in favor of validating conservation of biological entities like celltypes (with the caveat that ground-truth for biological data is difficult). However, to keep close to the current benchmarking you could compare all methods against projection to PCs computed on a reference dataset or use a factor model for the reference embedding of the dataset, e.g. NMF. Otherwise, the message reduces to: If you want to get a result close to PCA, use a method that relies on the same PCA. For differential expression, it's more surprising that it works that well. Benchmarks influence how people are performing their analysis and we try to keep method use as sound as possible. You have our strong support that count data shouldn't be corrected for downstream analysis. @Nusob888: We summarized our ideas in a Twitter style https://twitter.com/_canergen/status/1772190381871907122 as well as the accompanying notebook. Maybe we keep it open before get_normalized_expression is fixed? You can also go ahead and close it.

pmelsted / AM_2024

scVI change corrected counts for tasks #1