Feasibility of including deduplicated alignments

Would it be feasible to support output of a deduplicated BAM as well?

Unfortunately, it doesn't fit the workflow. Merging duplicated UMIs requires a lot of R calls, but all BAM-related functionality is in C++. So, basically, the simplest solution would be to run correction of UMIs in R, save the list of CB+Gene+UMI+CorrectedUMI to some file, and then have a C++ script that parses this file and outputs the corrected one. To my experience, writing such a C++ script is generally faster than waiting for Python to do the same :) You basically need to take the BamTools library, iterate over the bam, update the tags and save it to another bam. Something like ~50 lines of code. Here is an example of iteration over bam, and here is another one for editing tags.

I am not interested in scRNA-seq counts but rather the ability of your pipeline to identify and deduplicate erroneous UMIs for other applications.

Do you mean "deduplicate erroneous scRNA-seq UMIs", or is it about some completely different kind of data? The approach should work whenever you have cells, genes and UMIs. But maybe it can also be adopted to other cases.

kharchenkolab / dropEst

Feasibility of including deduplicated alignments #109