google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

normal pass / fail rate? #72

Closed AlcaArctica closed 1 year ago

AlcaArctica commented 1 year ago

I am new to pacbio sequencing and have just tried out the deepconsensus pipeline on a fairly large dataset (12 Gbp genome). I obtained 6 subread.bam files which I each chunked into 500 chunks each. My resulting assembly does not look good, but of course there are many possible reasons for that. In any case, I wanted to go back and review my use of the deepconsensus tool. For example, I am running the following commands:

ccs /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/linked_subreads/subread1.bam /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/ccs_subreads1/ccs_chunk100.bam --min-rq=0.88 --chunk 100/500 -j 24 --log-level INFO --log-file /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/ccs_subreads1/ccs_chunk100.log
 actc -j 24 /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/linked_subreads/subread1.bam /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/ccs_subreads1/ccs_chunk100.bam /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/ccs_subreads1/ccs_chunk100.subreads_to_ccs.bam
deepconsensus run --subreads_to_ccs /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/ccs_subreads1/ccs_chunk100.subreads_to_ccs.bam --ccs_bam /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/ccs_subreads1/ccs_chunk100.bam --checkpoint /lustre/projects/dazzler/uelze/sw/deepconsensus_model/checkpoint --output /projects/dazzlerAssembly/asm_vpTaxBacc_BK34-6/ccs/ccs_subreads1/deepcons_chunk100.fastq

This is my result:

ZMWs input               : 8480         

ZMWs pass filters        : 4317 (50.91%)
ZMWs fail filters        : 4163 (49.09%)
ZMWs shortcut filters    : 0 (0.000%)

ZMWs with tandem repeats : 41 (0.985%)

Exclusive failed counts
Below SNR threshold      : 163 (3.915%)
Median length filter     : 0 (0.000%)
Lacking full passes      : 3734 (89.70%)
Heteroduplex insertions  : 98 (2.354%)
Coverage drops           : 15 (0.360%)
Insufficient draft cov   : 37 (0.889%)
Draft too different      : 0 (0.000%)
Draft generation error   : 112 (2.690%)
Draft above --max-length : 0 (0.000%)
Draft below --min-length : 0 (0.000%)
Reads failed polishing   : 0 (0.000%)
Empty coverage windows   : 0 (0.000%)
CCS did not converge     : 2 (0.048%)
CCS below minimum RQ     : 2 (0.048%)
Unknown error            : 0 (0.000%)

Additional passing metrics
ZMWs missing adapters    : 79 (1.830%)

The end of the log file reads:

Processed a batch of 17 ZMWs in 8.062 seconds
Processed 4317 ZMWs in 2039.006 seconds
Outcome counts: OutcomeCounter(empty_sequence=0, only_gaps=1, failed_quality_filter=69, failed_length_filter=0, success=4247)

Now my question is, is there something wrong about how I applied deepconsensus, which could explain my bad assembly? Is it normal to have such a high fail rate (about 50 % for all my chunks)? I have also attached you the full log for further information. log.txt

AndrewCarroll commented 1 year ago

Hi @AlcaArctica

A ~50% rate of CCS molecules failing Q20 filters seems not far from expectations for Sequel II SMRT cells. It might be slightly on the higher side of normal, and that could be explained if the read lengths are a bit longer than typical (closer to 20kb versus 15kb), or if the number of passes is for some other reason on the lower side.

I suspect the poorer assembly isn't due to anything in the deepconsensus run, but instead relates to either the genome being complex to assemble or something that would benefit from more coverage than you have.

AlcaArctica commented 1 year ago

Thank you, that is reassuring! I think this issue can be closed then. I will continue to explore my assembly workflow.