sequence depth issues in algorithms of LSI, LSA, LDA, PCA for dimension reduction

wangmeijiao commented 4 years ago

Hi all,

The sequencing depth of single cells would be an important factor that may hinder true discovers like cell type identification, pseudo-time paths calculation etc. As far as I know, many scATAC tools (cistopic with LDA, signac with LSI, cellranger-atac with LSA, episcanpy with PCA) have difficulties to deduce a true dimension reduced clustering space without pre-filtering low-depth cells (correct me if I miss something). However, in some cases, cells may perhaps indeed show less ATAC fragments (or low UMI transcription) for some biological reasons. Therefore how to precisely distinguish those cells from broken cells is a true challenging. There is very few information about this issue (one mentioned here https://github.com/timoast/signac/issues/106) and I think this is an important question and many researchers will be interested with it. In my case, I compared the UMAP plots before and after removal of the first four dimensions (the first dimension are indeed correlated with sequence depth, I excluded the first four dims for safe), the shape of scatterplot looks similar and positions of cell clusters with low-depth (not too low, at least 3k fragments per cell after prefiltering) remain unchanged too much. To summary my question, how to deal with cells with low depth to avoid false positive result but keep real cells? Any suggestions will be weIcome.

wangmeijiao commented 4 years ago

Add: I can upload result figures if needed.

timoast commented 4 years ago

Hi, you are correct that sequencing depth is a common source of significant technical variation between cells in scATAC-seq experiments. A few points to note:

You removed the first four components. This is definitely not recommended. Often the first component captures sequencing depth. Remember that these components are orthogonal to each other and ordered by the variance they explain. If the first component is highly correlated with sequencing depth, it's unlikely then that the second component will be too. By removing the first four components you're throwing away a lot of important information.
You raise a good point that there may be cells with low depth but are otherwise high-quality and there may be biological reasons for less open chromatin in those cells. This is one reason we recommend looking at other QC metrics like the TSS enrichment score, nucleosome banding score, and the fraction of reads in genomic blacklist regions when filtering cells, rather than just looking at sequencing depth.

timoast commented 4 years ago

Linking here to your other issue in episcanpy: https://github.com/colomemaria/epiScanpy/issues/60

wangmeijiao commented 4 years ago

Hi Tim @timoast,

Thanks for pointing out that the removal of first four components is not necessary, I was perhaps too cautious to avoid false-positive discovery. The correlation co-efficiency of the four components are 0.89, 0.17, 0.16 and -0.39. As you recommended, I only filter the first components and rerun the analysis. 1) Would it be safe to conclude that the final dimension reduction result and clustering UMAP result are of little relationship (if any) to cell sequence depth? 2) as for the cells with low depth but are otherwise high-quality, what would you suggest these cells could be? (for example, aging cells). Can you share some experience with me (and other researcher will certainly be interested) how to start to prove it (for example, to prove the aging status with some marker genes)?

Feel exciting to discuss with you and looking forwards to your response.

timoast commented 4 years ago

Would it be safe to conclude that the final dimension reduction result and clustering UMAP result are of little relationship (if any) to cell sequence depth?

You can assess this by plotting the total counts per cells on the UMAP, for example: FeaturePlot(object, 'nCount_peaks').

as for the cells with low depth but are otherwise high-quality, what would you suggest these cells could be? (for example, aging cells).

I think this will depend on what tissue/cell types you're looking at, it's possible that certain cell types will have less overall open chromatin. I'm not aware of any correlation between cell age and overall chromatin accessiblity.

Can you share some experience with me (and other researcher will certainly be interested) how to start to prove it (for example, to prove the aging status with some marker genes)?

One thing you can look at is the QC metrics I mentioned above (TSS enrichment score, nucleosome banding pattern, etc.). If these metrics indicate the cells are high-quality, and if the cells contain cell-type-specific patterns of open chromatin, then that would provide evidence that they are real cells and rather than an artefact. You could also try to see if those cells have a match in a corresponding scRNA-seq dataset using other methods we have developed: https://satijalab.org/signac/articles/pbmc_vignette.html#integrating-with-scrna-seq-data Proper experimental validation would be more difficult, perhaps involving cloning enhancers specific to the cell type and using them to drive activation of a marker gene in vivo (as one example)

wangmeijiao commented 4 years ago

Hi Tim @timoast ,

I finished to calculate the nucleosome signal (and plotting the banding pattern) and the TSS enrichment score.
and

For nucleosome signal, which I understand to stand for ratio of mononucleosomal to nucleosome-free fragments, threshold value was calculated from the upper whisker (1.6). For TSS score, I used the threshold value of 1. After marking cells below threshold, I just found no special enrichment on certain clusters with low sequence-depth (runUMAP with dim 2-30, exclude the first component). But some clusters indeed show low sequencing depth. . Here is my question: could I draw a conclusion that these cells with low-sequencing depth are real cells and may be of biological meanings?

Thanks again.

timoast commented 4 years ago

You could look at the TSS enrichment score and nucleosome signal score for those cells by plotting the scores on the UMAP as you've done for the total counts. You might find they are at the lower end of the cutoff value you chose. You can also check if they contain cluster-specific accessible peaks.

wangmeijiao commented 4 years ago

Thanks!

stuart-lab / signac

sequence depth issues in algorithms of LSI, LSA, LDA, PCA for dimension reduction #122