Closed daaaaande closed 1 year ago
@daaaaande thank you for your thorough investigation here. To address some of your comments:
very similar results are also coming from 2 human DNA smrtcells from SequelII systems, where the output was already good. The q scores were much lower in the deepconsensus versus the ccs file
... and even a few reads were missing!
min_quality
is less than 20. If you want to recover all reads you can set --min_quality=0
when running DeepConsensus.Another point that i do not understand is the base-dependent q score that dissapears after deepconsensus: see fastp reports below.
This is an interesting observation. Base probabilities are generated using the outputs of the DeepConsensus model which appears to remove the base-dependent effect. Further investigation here would be helpful to determine if the quality predictions accurately reflect the base errors rates when stratified by base with both CCS and DeepConsensus.
thanks for your swift aswers!
DeepConsensus caps base qualities at 40 currently. This corresponds with a predicted error rate of 1/10,000, a rate that corresponds with approx 1 error / HiFi read. We plan to allow users to configure this capping behavior in the next release.
By default, DeepConsensus will filter out reads where the min_quality is less than 20. If you want to recover all reads you can set --min_quality=0 when running DeepConsensus.
Further investigation here would be helpful to determine if the quality predictions accurately reflect the base errors rates when stratified by base with both CCS and DeepConsensus.
I will close this issue for now since you explained the reason for my biggest concern (the high delta in ccs - deepconsensus base quality averages)
possibly related to #54
So to test deepconsensus 1.0 i ran the example data with ccs and deepconsensus according to the quick start guide. The attached multiqc report shows lower q scores in both deepconsensus compared to the ccs equivalent
deepconsensus output .fastq with the lower quality:
and the corresponding ccs file, much higher quality:
both deepconsensus files show a similar drop in quality.
very similar results are also coming from 2 human DNA smrtcells from SequelII systems, where the output was already good. The q scores were much lower in the deepconsensus versus the ccs file, and even a few reads were missing!
to orthogonally check the results i mapped both files (the human samples) to hs1 and got marginally better mapping % with the deepconsensus output compared to the ccs.
Another point that i do not understand is the base-dependent q score that dissapears after deepconsensus: see fastp reports below.
ccs:
and the deepconsensus file:
for all these files i ran deepconsensus1.0.0 cpu with no chunking in any step of the process.
sysinfo:
deepconsensus --version
2023-01-24 13:12:33.416614: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1.0.0
uname -a
Linux 5.15.0-57-generic #63-Ubuntu SMP Thu Nov 24 13:43:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
ccs --version
ccs 6.4.0 (commit v6.4.0)
Using: unanimity : 6.4.0 (commit v6.4.0) pbbam : 2.1.0 (commit v2.0.0-26-g05a8314) pbcopper : 2.0.0 (commit v2.0.0-52-ga0c9454) boost : 1.76 htslib : 1.15 zlib : 1.2.11
multiqc --version multiqc, version 1.13.dev0
fastqc --version FastQC v0.11.9
is there maybe a non-default option i missed?