google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

DeepConsensus 0.3 --min_quality has no default? #31

Closed MartinPippel closed 2 years ago

MartinPippel commented 2 years ago

Dear DeepConsensus developers,

thanks for the update. I really like the performance improvements of release 0.3.

Quick question: Why does the --min_quality flag is not set to 20? Due to the great yield, it was a bit hard to detect that I got several reads with an error rate >> 2%.

I usually run DC (v03) only on reads that have 97%-99.9% consensus accuracy (for time reasons). Without setting the --min_quality flag to 20 I do get the following results (filtered afterwards with >=Q20: pass, otherwise fail)

file                  num_seqs      sum_len  min_len   avg_len  max_len      Q1      Q2      Q3  sum_gap     N50  Q20(%)  Q30(%)
DC_OUT_fail.fq      33,456  511,999,446      577  15,303.7   49,575  12,407  14,738  17,771        0  15,831   81.25   54.75
DC_OUT_pass.fq      796,187  12,815,329,195      510  16,095.9   56,419  13,044  15,533  18,748        0  16,679   98.36   94.94

When I add the --min_quality 20 value to the deepconsensus run: I get the expected results:

./m54345U_220708_140613_v03/DCIN_GE0.97_LE0.999/DC_OUT.deepconsensus.fq.gz  FASTQ   DNA     796,188  12,815,347,763      510  16,095.9   56,419    13,044  15,533    18,748        0  16,679   98.36   94.94

Just for completeness. A very very small fraction of the DC improved reads got a lower QV value compared to the input reads. When I try rescue those, that initially had >=Q20 I get the following "rescued" read stats:

./m54345U_220708_140613_v03/DCIN_GE0.97_LE0.999/DC_OUT.ccsRescued.fq.gz     FASTQ   DNA       2,136      31,701,185      573  14,841.4   38,323  11,889.5  14,319  17,213.5        0  15,400   95.24   90.55

Once again thanks for the great tool. I just posted this message so that others are aware of this behaviour. The attached figure compares unfiltered QV values between pbccs v6.4 - deepconsensus v0.2 - deepconsensus v0.3 (from left to right) The red dots highlight reads <Q20. image

Cheers, Martin

danielecook commented 2 years ago

@MartinPippel thank you for sharing your results. The plot is nice.

We decided to change the min-quality default to output all reads by default. Users should set this value as needed for their use case.

I have updated the release notes to add an item detailing this change.

danielecook commented 2 years ago

@MartinPippel after discussing this issue we decided to revert the --min-quality flag to the original default value of 20. This update was just release (v0.3.1), and that is the only significant change in that release.