Closed flowers9 closed 3 years ago
Nevermind. Found a slight coding error in the test code (and tons of read alignments that were actually screened out).
Sorry for any confusion!
That said, it still seems unnecessary to add the qs tag, and it's quite slow. A simple c++ program, compiled and linked against the pacbio Bam library runs nearly 6x as fast on my test case of a 1.2 TB aligned subreads bam (and using the default 4 threads for compression, as per the library).
@flowers9 thank you for your suggestions. We are actively exploring ways to speed up processing, including through the use of the pbbam and pbcore.io libraries. The qs tag appears to be redundant as you suggest, but my concern there is that other tools and scripts may use it. I would have to investigate further whether it makes sense to remove it.
We are working on an alternative solutions that only aligns subreads to their respective CCS read, so hopefully we can remove this filtering step.
Yeah, that would be the obvious win.
I meant that it doesn't seem necessary to add the qs tag not because it's redundant but because it's already present - even your example data set has it in the aligned.subreads.bam file, before the filter step.
In the quick start guide, you've got this code:
It seems to be checking to make sure subreads are aligned against the corresponding ccs read and that it has a quality start tag that matches the start position from the name. However, at least for my version of pbmm2 (1.5.0), all that seems to already be true, meaning this code doesn't seem to do anything. And it's really slow, too - for my test case, with maybe 32gb of ccs sequence, pbmm2 ran in ~20h on a 64-core machine. But this will take maybe 3x as long, and it's not like I can throw cores at it to speed it up (when I tried adding -@ # to samtools, it'd just core out).
Is this code actually necessary for anything, or can I just remove this step?
(I'll note it also seems to not be doing anything for your test data, although I didn't throroughly check that.)