scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 603 forks source link

add multibatchPCA approach? #1289

Open bjstewart1 opened 4 years ago

bjstewart1 commented 4 years ago

It may be useful to adopt a PCA option similar to multiBatchPCA in the R batchelor package. This is a useful approach where there are imbalances in batch size and PCA is conducted across a merged experiment. It is pretty slow in R.

From their documentation:

"Our approach is to effectively weight the cells in each batch to mimic the situation where all batches have the same number of cells. This ensures that the low-dimensional space can distinguish subpopulations in smaller batches. Otherwise, batches with a large number of cells would dominate the PCA, i.e., the definition of the mean vector and covariance matrix. This may reduce resolution of unique subpopulations in smaller batches that differ in a different dimension to the subspace of the larger batches."

giovp commented 4 years ago

Hi, thanks for the suggestion! Are you referring to this function ? It sounds a bit like ingest but with multiple datasets, pinging @Koncopd to see what's his take on this

bjstewart1 commented 4 years ago

Hi, thanks for the suggestion! Are you referring to this function ? It sounds a bit like ingest but with multiple datasets, pinging @Koncopd to see what's his take on this

Yes that's the function. I think it is doing something similar to ingest

I think this sort of batch-balanced PCA could be a useful addition addition where batches are very uneven in terms of number of cells.

Koncopd commented 4 years ago

ingest uses pca only from a reference batch, so it is a bit different.

Does this multiBatchPCA work well?

bjstewart1 commented 4 years ago

Like you say, the difference between this and ingest is joint PCA calculation vs asymmetric batch integration.

This function is the first step in the fastMNN function, which I have found in some cases yields very sensible batch correction results. It would be awesome to see multiBatchPCA +/- fastMNN available in scanpy. I am aware of the python implementation of mnncorrect, but I think this still operates on expression values rather than a PCA representation (correct me if I am wrong..).

Without going all the way the batch correction, multiBatchPCA is useful where different experiments have very different numbers of cells.

r-reeves commented 4 years ago

Hi all,

I am trying to use ScanPy for integrating multiple scRNA-Seq samples (~20). Doing so that I can look at RNA Velocity with SCVelo, and want to use MNN as I got good batch effect removal previously in monocle using MNN.

Is it true - as stated above, that the current implementation of mnncorrect with ScanPy is only operating on expression values? I have run through a ScanPy MNN tutorial provided by NBI Sweden. The results are improved, but it doesn't appear to work as well as in monocle - some separation by batch is still going on.

I'm wondering what the difference might be? Whether it could be due to the difference in PCA (multi-batch), or the actual MNN / batch effect removal step. Alternatively, I could use the corrected expression matrix, and add the UMAP coordinates/clusters from monocle, although I wonder if this is advisable.

If you have any info please let me know, or if I should raise a separate issue etc.

giovp commented 4 years ago

what's the stage of this @Koncopd @Mirkazemi ?

Koncopd commented 3 years ago

Soon (i hope).

LuckyMD commented 3 years ago

Hi @r-reeves, Maybe this is indeed a separate issue. mnnpy is indeed working on the gene expression matrix, and not on a low dimensional embedding like FastMNN (which is what I assume you might have been using?). You could try Scanorama which is a method similar to FastMNN, using a sped up algorithm and no iterative merging of batches, but a method they call "panoramic stitching". It has performed quite well in our benchmark of data integration methods, and is in the scanpy ecosystem and therefore should work seamlessly in a Scanpy workflow.

All of this being said, you will only get an integrated graph structure with this for scvelo, which may help a little, but won't remove the batch effect for RNA velocity calculation. scvelo doesn't currently have any batch removal in its pipeline as it is quite difficult to add as it works directly from the normalized count data and fits a model to these. @VolkerBergen has been thinking a bit about how to perform batch correction in an scvelo model, maybe he could chime in, or you could post an issue in the scvelo repo.

r-reeves commented 3 years ago

Hi @LuckyMD Thank you for the fast reply. Yes to FastMNN, as I understand from using align_cds – when you specify discretely what you want to remove e.g. sample-sample variation it calls FastMNN from batchelor. Thanks for the recommendation – I will check out Scanorama, been meaning to read the review on integration techniques.

you will only get an integrated graph structure with this for scvelo, which may help a little, but won't remove the batch effect for RNA velocity calculation. scvelo doesn't currently have any batch removal in its pipeline as it is quite difficult to add as it works directly from the normalized count data and fits a model to these.

Ahh okay, I misunderstood the process then – my understanding was that some of the mnn correction would be carried over when performing velocity analysis. I will check out the scvelo forum for info on comparing samples.

Thank you.