google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

Quick start example code question #9

Closed flowers9 closed 3 years ago

flowers9 commented 3 years ago

In the quick start guide, you've got this code:

# Only subread alignments to the correct molecule were retained.
# We used samtools and awk to filter incorrect alignments using the below
# command:
samtools view -h "aligned.subreads.bam" | \
  awk '{ if($1 ~ /^@/) { print; } else { split($1,A,"/"); \
  split($3,B,"/"); if(A[2]==B[2]) { split(A[3],C,"_"); \
  print $0 "\tqs:i:" C[1]; } } }' | samtools view -b > "subreads_to_ccs.bam"

It seems to be checking to make sure subreads are aligned against the corresponding ccs read and that it has a quality start tag that matches the start position from the name. However, at least for my version of pbmm2 (1.5.0), all that seems to already be true, meaning this code doesn't seem to do anything. And it's really slow, too - for my test case, with maybe 32gb of ccs sequence, pbmm2 ran in ~20h on a 64-core machine. But this will take maybe 3x as long, and it's not like I can throw cores at it to speed it up (when I tried adding -@ # to samtools, it'd just core out).

Is this code actually necessary for anything, or can I just remove this step?

(I'll note it also seems to not be doing anything for your test data, although I didn't throroughly check that.)

flowers9 commented 3 years ago

Nevermind. Found a slight coding error in the test code (and tons of read alignments that were actually screened out).

Sorry for any confusion!

flowers9 commented 3 years ago

That said, it still seems unnecessary to add the qs tag, and it's quite slow. A simple c++ program, compiled and linked against the pacbio Bam library runs nearly 6x as fast on my test case of a 1.2 TB aligned subreads bam (and using the default 4 threads for compression, as per the library).

deepconsensus_helper.cc.txt

danielecook commented 3 years ago

@flowers9 thank you for your suggestions. We are actively exploring ways to speed up processing, including through the use of the pbbam and pbcore.io libraries. The qs tag appears to be redundant as you suggest, but my concern there is that other tools and scripts may use it. I would have to investigate further whether it makes sense to remove it.

We are working on an alternative solutions that only aligns subreads to their respective CCS read, so hopefully we can remove this filtering step.

flowers9 commented 3 years ago

Yeah, that would be the obvious win.

I meant that it doesn't seem necessary to add the qs tag not because it's redundant but because it's already present - even your example data set has it in the aligned.subreads.bam file, before the filter step.