scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.81k stars 586 forks source link

diffusion map and batch effect correction #168

Closed wangjiawen2013 closed 6 years ago

wangjiawen2013 commented 6 years ago

Can you extend scanpy functions so that I can show gene expression level on plot generated by sc.pl.diffmap? just like that monocle2 does.

And, in which step should I execute MNN batch effect correction ? Is it still necessary to regress out some variables ( n_counts, percent_mito, cell cycle et al.,) when I execute MNN ?

falexwolf commented 6 years ago

You can do the first already now by passing color=genename to pl.diffmap.

I don't have much experience with mnn_correct but, if cell cycle is a problem, you can definitely still regress this out; for instance, on a per-batch level.

falexwolf commented 6 years ago

Ah, sorry, maybe this wasn't clear. You need to set the .raw attribute of AnnData for doing that at some point.

adata.raw = adata  # at the point during preprocessing at which you wish store a copy for visualization and differential testing

You can then set use_raw=False in several functions, if you want to acess .X instead.

wangjiawen2013 commented 6 years ago

It is said that "Be reminded that it is not advised to use the corrected data matrices for differential expression testing." in scanpy document (http://scanpy.readthedocs.io/en/latest/api/scanpy.api.pp.mnn_correct.html) when execute MNN correction. However, Haghverdi Laleh (the one who presents MNN correction strategy, https://www.nature.com/articles/nbt.4091) says "MNN correction improves differential expression analyses, After batch correction is performed, the corrected expression values can be used in routine downstream analyses such as clustering prior to differential gene expression identification" in his Nature Biotech paper. So, I am a little confused. We have compared some corrections methods, such as regress_out, combat, MNN and MultiCCA (used by seurat), the results show that MNN and CCA have a better effect than regress_out and combat.

wangjiawen2013 commented 6 years ago

MNN and CCA is of great use when analyze mutli single cell libraries which are merged together, because each library maybe disturbed by batch effect.

LuckyMD commented 6 years ago

Hi. Maybe I can help a little as well.

Typically batch correction or data integration methods would be used to obtain good clustering of the data, however once differential testing is performed it is still unclear whether the corrected data can or should be used (no batch correction method is perfect and may overcorrect).

The standard strategy would be to correct for batch, and any other covariates that you are not interested in for the clustering process. Once you have the clusters, it is standard practice to go back to the raw data and use a differential testing algorithm that allows you to account for batch and other technical covariates in the model (e.g. MAST).

wangjiawen2013 commented 6 years ago

@falexwolf In your Bioinformatics paper "destiny: diffusion maps for large-scale single-cell data in R", you show how to determine the optimal Gaussian kernel width and the plot of The Eigenvalues of the first 100 diffusion components. Could you tell us how to perform it with scanpy?

flying-sheep commented 6 years ago

@wangjiawen2013 that would be my paper, and I don’t think scanpy stores the eigenvalues after computing the diffusion map.

gokceneraslan commented 6 years ago

It should be stored in adata.uns['diffmap_evals'] according to https://github.com/theislab/scanpy/blob/master/scanpy/tools/dpt.py#L17

falexwolf commented 6 years ago

Yes, the eigenvalues are stored.

There is no need to choose a kernel width within in Scanpy. Anything is done automatically. The only parameters are the number of neighbors and the kernel type (method in pp.neighbors).