Can we use batch as block when finding markers for each cluster?

genecell commented 5 years ago

Hi,

In scran's findMarkers(), users can set batch as the 'block' in the model, so marker identification will not be influenced by batch effect (https://rdrr.io/bioc/scran/man/findMarkers.html, I found this really useful). Can I do similar things in scanpy? Thanks!

I tried using anndata2ri to convert anndata to SingleCellExperiment, so as to still use scran's findMarker(), but always got this error:

ValueError: Converting pandas "Category" series to R factor is only possible when categories are strings.

I have checked several columns in my adata.obs, but cannot find the cause.

Thanks in advance, BP

LuckyMD commented 5 years ago

Hi,

Currently using covariates in sc.tl.rank_genes_groups() is not implemented. In the Wilcoxon and t-test versions this is also not possible. However, in logistic regression this could be added. As a coarse approximation you could correct for batch using sc.pp.combat() and then use the corrected data instead of adata.raw (which is the default) to calculate marker genes. However, generally I would not recommend performing statistical analysis on batch-corrected data for other tests.

Regarding your anndata2ri error... you could also check adata.var columns to see if any are categorical, but numeric.

genecell commented 5 years ago

Hi @LuckyMD,

Thank you for your rapid reply! In mnnCorrect's paper, the authors claimed that ComBat cannot be applied to some single-cell RNA sequencing data, since there are always multiple different cell types in each dataset, how do you think about that? Maybe, ComBat cannot handle well with the cases where different cell types are influenced by the batch effect in different ways or levels. I am afraid that batch effects are not accurately corrected, and I am still puzzled about which method may give better results, i.e., calculating marker genes basing on batch-corrected data or including batch as a covariate in the raw data. (Is there any available paper discussing this problem?)

In addition, I will check adata.var soon.

Thanks, BP

LuckyMD commented 5 years ago

Hi @genecell,

We have a review paper on current best-practices in scRNA-seq analysis which is coming out soon in Molecular Systems Biology that discusses this a bit. The issue with batch correction in scRNA-seq data isn't that batch affects different cell types differently, but rather that if cell type compositions change between batches, then transcriptional differences between the cell types that differ between the batches confound the technical batch effect estimation. So you end up correcting for more than just the technical effect. This means that you can use Combat if the cell type compositions are expected to be similar between batches. Indeed, ComBat is shown to outperform MNN for simple batch correction scenarios (kBet paper).

Inspite of the above argument, the better way to do things is definitely to include batch as a covariate. That way you don't underestimate your background variance. In the case of marker gene detection, this is not quite so problematic as:

It is an easy problem, as cell-type differences tend to be very pronounced so you should always detect a signal even with non-optimal methods.
The p-values you calculate from marker gene detection are inflated anyway and therefore not meaningful.

We discuss the above points in our manuscript. I'm not aware whether using corrected data for differential expression testing is discussed anywhere else though. If you email me, I could forward you a copy of the manuscript, but it should be available in MSB in the next weeks. The issue with inflated p-values is also discussed is a few other places like here.

genecell commented 5 years ago

Hi @LuckyMD,

Thank you for your detailed reply! Your explanation of the issue with batch correction in scRNA-seq data is very straightforward. In kBET paper, the performance of ComBat in simple batch correction scenarios is impressive. (I have once used kBET, but found it very slow.)

In addition, with your help, I have solved the problem of using scran's findMarkers() function:

tmp_cluster=adata.obs['leiden'].astype(int)

%%R -i tmp_cluster -i adata -o tmp_allMarkers
tmp_allMarkers<-scran::findMarkers(adata,clusters=tmp_cluster,block=adata$batch,direction="up")
tmp_allMarkers<-as.list(tmp_allMarkers)

LuckyMD commented 5 years ago

You could even make that easier, by not passing tmp_cluster at all, but using pData(adata)$leiden directly in the function. But great that it works for you!

scverse / scanpy

Can we use batch as block when finding markers for each cluster? #691