google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

lower quality and less reads in deepconsensus 1.0 output compared to ccs #57

Closed daaaaande closed 1 year ago

daaaaande commented 1 year ago

possibly related to #54

So to test deepconsensus 1.0 i ran the example data with ccs and deepconsensus according to the quick start guide. The attached multiqc report shows lower q scores in both deepconsensus compared to the ccs equivalent

deepconsensus output .fastq with the lower quality: Bildschirmfoto vom 2023-01-24 13-10-46

and the corresponding ccs file, much higher quality: Bildschirmfoto vom 2023-01-24 13-10-56

both deepconsensus files show a similar drop in quality.

very similar results are also coming from 2 human DNA smrtcells from SequelII systems, where the output was already good. The q scores were much lower in the deepconsensus versus the ccs file, and even a few reads were missing!

to orthogonally check the results i mapped both files (the human samples) to hs1 and got marginally better mapping % with the deepconsensus output compared to the ccs.

Another point that i do not understand is the base-dependent q score that dissapears after deepconsensus: see fastp reports below.

ccs: Bildschirmfoto vom 2023-01-24 13-18-10

and the deepconsensus file:

Bildschirmfoto vom 2023-01-24 13-19-00

for all these files i ran deepconsensus1.0.0 cpu with no chunking in any step of the process.

sysinfo:

deepconsensus --version
2023-01-24 13:12:33.416614: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1.0.0

uname -a
Linux 5.15.0-57-generic #63-Ubuntu SMP Thu Nov 24 13:43:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

ccs --version
ccs 6.4.0 (commit v6.4.0)

Using: unanimity : 6.4.0 (commit v6.4.0) pbbam : 2.1.0 (commit v2.0.0-26-g05a8314) pbcopper : 2.0.0 (commit v2.0.0-52-ga0c9454) boost : 1.76 htslib : 1.15 zlib : 1.2.11

multiqc --version multiqc, version 1.13.dev0

fastqc --version FastQC v0.11.9

is there maybe a non-default option i missed?

danielecook commented 1 year ago

@daaaaande thank you for your thorough investigation here. To address some of your comments:

very similar results are also coming from 2 human DNA smrtcells from SequelII systems, where the output was already good. The q scores were much lower in the deepconsensus versus the ccs file

... and even a few reads were missing!

Another point that i do not understand is the base-dependent q score that dissapears after deepconsensus: see fastp reports below.

This is an interesting observation. Base probabilities are generated using the outputs of the DeepConsensus model which appears to remove the base-dependent effect. Further investigation here would be helpful to determine if the quality predictions accurately reflect the base errors rates when stratified by base with both CCS and DeepConsensus.

daaaaande commented 1 year ago

thanks for your swift aswers!

DeepConsensus caps base qualities at 40 currently. This corresponds with a predicted error rate of 1/10,000, a rate that corresponds with approx 1 error / HiFi read. We plan to allow users to configure this capping behavior in the next release.

By default, DeepConsensus will filter out reads where the min_quality is less than 20. If you want to recover all reads you can set --min_quality=0 when running DeepConsensus.

Further investigation here would be helpful to determine if the quality predictions accurately reflect the base errors rates when stratified by base with both CCS and DeepConsensus.

I will close this issue for now since you explained the reason for my biggest concern (the high delta in ccs - deepconsensus base quality averages)