google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

Missing about 30% of ZMWs in output #42

Closed gevro closed 2 years ago

gevro commented 2 years ago

Hi, I'm running the below and found that ~30% of ZMWs are missing from the deepconsensus FASTQ output, even though I see them in the input CCS bam and input subreads BAM:

deepconsensus_0.3.1.sif deepconsensus run --batch_size=1024 --batch_zmws=100 --cpus 4 --max_passes 20 --subreads_to_ccs=subreads.bam --ccs_bam=ccs.bam --checkpoint=/model/checkpoint

Is this expected behavior? Is there any way to see in the logs why many ZMWs are not in the output?

PS: I don't think this is due to deepconsensus output reads having lower quality than the threshold of Q20, because I'm using ccs BAM input with --min-rq=0.99. I know you recommend lower than that, but if anything, inputting ccs BAM with reads > ccs rq 0.99 should not have 30% of reads failing to have a consensus from Deepconsensus.

PPS: I manually ran deepconsensus on the ccs and subreadstoccs of one ZMW that failed to be output by deepconsensus and I got this: failed_quality_filter=1. In CCS, the rq of this ZMW was rq:f:0.994125. Does Deepconsensus have a more stringent definition of read quality, such that it outputs fewer ZMWs than CCS?

danielecook commented 2 years ago

By default, DeepConsensus filters reads at >=Q20.

Try running using --min_quality=0.

pichuan commented 2 years ago

Hi @gevro , hopefully @danielecook 's answer resolved your issue. I'm closing this. If you have more questions, please feel free to open another issue or reopen. Thank you!