I am running benchmark to compare performance of 0.3.x, 0.4.x and 0.5.x with both 4.2 and 4.3 sup model. Since
0.3.4 doesn't support 4.3 model, so I didn't run it.
Testing data I used is:
s3://ont-open-data/giab_2023.05/flowcells/hg002/20230429_1600_2E_PAO83395_124388f5/
It is chosen as its throughput is quite close to what I got routinely at my work place.
The command I ran is like:
./dorado basecaller -r --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_sup@v4.2.0 ./ont-open-data/giab_2023.05/hg002/20230429_1600_2E_PAO83395_124388f5/ > HG002.ubam
To measure base accuracy, I aligned all the reads to the HG002v1.0.1 diploid genome which is claimed to be Q75.
Please provide a clear and concise description of the issue you are seeing and the result you expect.
Observations and Questions:
Surprisingly, 0.3.4 generates 10% longer reads than 0.4.3 and 0.5.3 and 3x more >=50K reads. I think this is more important than higher base accuracy especially for assembly and SV calling. Why does the reads getting shorter with newer dorado? Based on my experience of ONT data, there is a small percentage of reads that are concatenation of two strands of the same molecule. Does 0.4.3 and 0.5.3 has ways to combine them or cut them off such that the reads are shorter and the longer 0.3.4 reads are actually useless?
4.3 model performs better than 4.2 model in both 0.4.3 and 0.5.3. Most notable is that for the qs<10 reads, error rate reduces from 17% to 10%. Perhaps there is room to lower to qs cutoff?
Interestingly, dorado 0.4.3 performs slightly better than 0.5.3 in accuracy. What happened?
Issue Report
Please describe the issue:
I am running benchmark to compare performance of 0.3.x, 0.4.x and 0.5.x with both 4.2 and 4.3 sup model. Since 0.3.4 doesn't support 4.3 model, so I didn't run it. Testing data I used is: s3://ont-open-data/giab_2023.05/flowcells/hg002/20230429_1600_2E_PAO83395_124388f5/ It is chosen as its throughput is quite close to what I got routinely at my work place. The command I ran is like: ./dorado basecaller -r --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_sup@v4.2.0 ./ont-open-data/giab_2023.05/hg002/20230429_1600_2E_PAO83395_124388f5/ > HG002.ubam To measure base accuracy, I aligned all the reads to the HG002v1.0.1 diploid genome which is claimed to be Q75.
Please provide a clear and concise description of the issue you are seeing and the result you expect. Observations and Questions:
Here are the numbers: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">