kharchenkolab / dropEst

Pipeline for initial analysis of droplet-based single-cell RNA-seq data
GNU General Public License v3.0
88 stars 42 forks source link

Should UMI correction be done before or after QC ? #16

Closed ahy1221 closed 6 years ago

ahy1221 commented 6 years ago

After going through vignette , I wonder that should the UMI correction phased be done before or after QC ? The reads_per_umi_per_cell from dropEst is already filtered. If I perform some manual filter again and then rescued some cells not in the reads_per_umi_per_cell , how can I get reads_per_umi_per_cell for these rescued cells ?

Since the UMI count is the corrected count and the paper mention that

the proposed corrections can result in significantly higher expression profile correlation of cells belonging to the same cluster, compared to uncorrected molecular counts

I am wondering that do you have some specific gene examples in the specific known cell population ( CD4 in T cells for example) indicating that such correction is very important to accurate gene abundance quantification ?

VPetukhov commented 6 years ago

Sorry that I missed filtration part in the documentation. I'll describe it as soon as I get a chance. About your question: normally, min_genes_after_merge threshold should be low enough to include all cells that you want to rescue. Thus ncol(holder$cm) is assumed to be larger than EstimateCellsNumber(Matrix::colSums(holder$cm_raw))$expected. Unfortunately, correcting UMIs in all cells from cm_raw is not efficient as reads_per_umi_per_cell is the largest part of holder, and UMI correction takes really long time for large datasets. Thus, as some features for the quality score are estimated by cm_raw, you need to run quality estimation prior to UMI correction.

I am wondering that do you have some specific gene examples in the specific known cell population ( CD4 in T cells for example) indicating that such correction is very important to accurate gene abundance quantification ?

We tried a lot of different ways to compare the correction methods from the biological point of view. The problem is that we don't know a real answer and it's hard to understand, which approach is right. Can you please describe your idea in more details? We're preparing answers to the reviewers and we would be happy to add another demonstration if it's able to distinguish correction algorithms. The question is what we expect to observe from the single gene after the correction? Even without it, we expect that this gene is overexpressed across the corresponding cell population. But it's expression level would be different for different correction algorithm. How to determine, which answer is better?

ahy1221 commented 6 years ago

Thank you very much ! In the setting of my own experiment, I am wondering is that the "dropout" of some genes are just due to wrong estimation of UMI count matrix. I would see such cases as the extremely situation of underestimation. Do you have some examples of these cases ? I think that would make people think dropEst is very necessary and is quite better than but not equally well as other tools.
In my example, it is not that easy to see CD4 expression in T cell population from 10x data. Since we know that CD4 expression has a mean TPM > 200 expression level in the CD4 positive T cell using Smartseq2 data , I might be wrong because this is the first time I look at 10x data but to me it is hard to imagine this gene would be drop out in the droplet-based seq.

Yao He