lower quality and less reads in deepconsensus 1.0 output compared to ccs

daaaaande commented 1 year ago

possibly related to #54

So to test deepconsensus 1.0 i ran the example data with ccs and deepconsensus according to the quick start guide. The attached multiqc report shows lower q scores in both deepconsensus compared to the ccs equivalent

deepconsensus output .fastq with the lower quality: Bildschirmfoto vom 2023-01-24 13-10-46

and the corresponding ccs file, much higher quality: Bildschirmfoto vom 2023-01-24 13-10-56

both deepconsensus files show a similar drop in quality.

very similar results are also coming from 2 human DNA smrtcells from SequelII systems, where the output was already good. The q scores were much lower in the deepconsensus versus the ccs file, and even a few reads were missing!

to orthogonally check the results i mapped both files (the human samples) to hs1 and got marginally better mapping % with the deepconsensus output compared to the ccs.

Another point that i do not understand is the base-dependent q score that dissapears after deepconsensus: see fastp reports below.

ccs: Bildschirmfoto vom 2023-01-24 13-18-10

and the deepconsensus file:

Bildschirmfoto vom 2023-01-24 13-19-00

for all these files i ran deepconsensus1.0.0 cpu with no chunking in any step of the process.

sysinfo:

deepconsensus --version
2023-01-24 13:12:33.416614: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1.0.0

uname -a
Linux 5.15.0-57-generic #63-Ubuntu SMP Thu Nov 24 13:43:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

ccs --version
ccs 6.4.0 (commit v6.4.0)

Using: unanimity : 6.4.0 (commit v6.4.0) pbbam : 2.1.0 (commit v2.0.0-26-g05a8314) pbcopper : 2.0.0 (commit v2.0.0-52-ga0c9454) boost : 1.76 htslib : 1.15 zlib : 1.2.11

multiqc --version multiqc, version 1.13.dev0

fastqc --version FastQC v0.11.9

is there maybe a non-default option i missed?

danielecook commented 1 year ago

@daaaaande thank you for your thorough investigation here. To address some of your comments:

very similar results are also coming from 2 human DNA smrtcells from SequelII systems, where the output was already good. The q scores were much lower in the deepconsensus versus the ccs file

DeepConsensus caps base qualities at 40 currently. This corresponds with a predicted error rate of 1/10,000, a rate that corresponds with approx 1 error / HiFi read. We plan to allow users to configure this capping behavior in the next release.

... and even a few reads were missing!

By default, DeepConsensus will filter out reads where the min_quality is less than 20. If you want to recover all reads you can set --min_quality=0 when running DeepConsensus.

Another point that i do not understand is the base-dependent q score that dissapears after deepconsensus: see fastp reports below.

This is an interesting observation. Base probabilities are generated using the outputs of the DeepConsensus model which appears to remove the base-dependent effect. Further investigation here would be helpful to determine if the quality predictions accurately reflect the base errors rates when stratified by base with both CCS and DeepConsensus.

daaaaande commented 1 year ago

thanks for your swift aswers!

DeepConsensus caps base qualities at 40 currently. This corresponds with a predicted error rate of 1/10,000, a rate that corresponds with approx 1 error / HiFi read. We plan to allow users to configure this capping behavior in the next release.

why ? did you see one error in each read in the data? tbh i do not see a logical reason for this in my data. also, since the input has higher q scores i would expect to hit the ceiling (40) on all reads that exceed that in the "input". is that due to the model being trained to be "conservative" with the q scores? anyway, the possibility to remove the cap would be fantastic!

By default, DeepConsensus will filter out reads where the min_quality is less than 20. If you want to recover all reads you can set --min_quality=0 when running DeepConsensus.

thanks, i did not know this. Also i did not check if all missing reads are q<20.

Further investigation here would be helpful to determine if the quality predictions accurately reflect the base errors rates when stratified by base with both CCS and DeepConsensus.

since ccs seems to have a base-dependence here and Deepconsensus seems to remove it mostly, i will forward this to pacbio. There might be a base-dependent signal/noise characteristic here caused by the chemistry or optics that i did not know about. Anyway, making the scores of all bases more similar seems like an improvement to me.

I will close this issue for now since you explained the reason for my biggest concern (the high delta in ccs - deepconsensus base quality averages)

google / deepconsensus

lower quality and less reads in deepconsensus 1.0 output compared to ccs #57