kharchenkolab / dropEst

Pipeline for initial analysis of droplet-based single-cell RNA-seq data
GNU General Public License v3.0
86 stars 43 forks source link

Feasibility of including deduplicated alignments #109

Open ijhoskins opened 4 years ago

ijhoskins commented 4 years ago

I see that the dropEst program reports a matrix of counts for genes in the input GTF. Would it be feasible to support output of a deduplicated BAM as well? I am not interested in scRNA-seq counts but rather the ability of your pipeline to identify and deduplicate erroneous UMIs for other applications. I realize this may be out-of-scope but your pipeline appears to be the superior solution for determining UMI duplicate networks!

VPetukhov commented 4 years ago

Would it be feasible to support output of a deduplicated BAM as well?

Unfortunately, it doesn't fit the workflow. Merging duplicated UMIs requires a lot of R calls, but all BAM-related functionality is in C++. So, basically, the simplest solution would be to run correction of UMIs in R, save the list of CB+Gene+UMI+CorrectedUMI to some file, and then have a C++ script that parses this file and outputs the corrected one. To my experience, writing such a C++ script is generally faster than waiting for Python to do the same :) You basically need to take the BamTools library, iterate over the bam, update the tags and save it to another bam. Something like ~50 lines of code. Here is an example of iteration over bam, and here is another one for editing tags.

I am not interested in scRNA-seq counts but rather the ability of your pipeline to identify and deduplicate erroneous UMIs for other applications.

Do you mean "deduplicate erroneous scRNA-seq UMIs", or is it about some completely different kind of data? The approach should work whenever you have cells, genes and UMIs. But maybe it can also be adopted to other cases.