nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
495 stars 59 forks source link

Any Duplex Models for R9.4.1 #733

Closed StephDC closed 5 months ago

StephDC commented 5 months ago

Issue Report

Please describe the issue:

I have a DNA sample sequenced with R9.4.1 that I wonder if I could do duplex basecalling on it or not.

According to the current model selection source code, it seems that duplex only works for all 4 conditions of R10.4.1. Are there any plan to implement duplex basecalling for R9.4.1?

https://github.com/nanoporetech/dorado/blob/release-v0.6.0/dorado/models/models.cpp#L414-L440

If not, I would like add a list to the currently supported duplex basecalling models on the README.md. An explanation on why such model is not provided / not possible would be greatly appreciated.

Steps to reproduce the issue:

Run the following command on a run that uses R9.4.1

dorado duplex sup R9.4.1/pod5/ > output.bam

And you would be greeted with an error message saying no model available.

[error] Failed to get stereo duplex model
[error] No matches for chemistry: dna_r9.4.1_e8, model_variant: sup, version: v3.6.0

Run environment:

Logs

[2024-04-08 11:15:16.282] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-04-08 11:15:16.564] [debug] > Reads to process: 4000
[2024-04-08 11:15:16.685] [trace] 'sup' found variant: 'sup' and version: 'latest'
[2024-04-08 11:15:16.698] [trace] POD5: R9.4.1/pod5/PAU12345_01234567_89abcdef_0.pod5 flowcell_product_code: 'FLO-PRO002' sequencing_kit: 'sqk-lsk109' sample_rate: 4000
[2024-04-08 11:15:16.698] [trace] Searching for: chemistry: dna_r9.4.1_e8, model_variant: sup
[2024-04-08 11:15:16.698] [trace] Reject dna_r9.4.1_e8_fast@v3.4 on model type: sup
[2024-04-08 11:15:16.698] [trace] Reject dna_r9.4.1_e8_hac@v3.3 on model type: sup
[2024-04-08 11:15:16.698] [trace] Found 2 model matches:
[2024-04-08 11:15:16.698] [trace] - dna_r9.4.1_e8_sup@v3.3
[2024-04-08 11:15:16.698] [trace] - dna_r9.4.1_e8_sup@v3.6
[2024-04-08 11:15:16.698] [trace] 'sup' found variant: 'sup' and version: 'latest'
[2024-04-08 11:15:16.844] [info]  - downloading dna_r9.4.1_e8_sup@v3.6 with httplib
[2024-04-08 11:15:24.377] [trace] Searching for: chemistry: dna_r9.4.1_e8, model_variant: sup, version: v3.6.0
[2024-04-08 11:15:24.377] [trace] Found 0 model matches:
[2024-04-08 11:15:24.377] [error] Failed to get stereo duplex model
[2024-04-08 11:15:24.516] [trace] Cleaning temporary model path: /home/nanopore/basecall_benchmark/.temp_dorado_model-2807268a259e06d6
[2024-04-08 11:15:24.575] [error] No matches for chemistry: dna_r9.4.1_e8, model_variant: sup, version: v3.6.0
vellamike commented 5 months ago

Hi @StephDC, I'm afraid that Dorado Duplex basecalling only works for R10.4.1. I understand this is likely frustrating, but the reason for this is that the current duplex algorithms were developed and tested specifically with R10.4.1 Our development effort is focused on continuing to improve R10 yield and accuracy.

StephDC commented 5 months ago

Thanks for the info.

By the way, in order to avoid the future confusion, would you mind to add, or accept a PR to add, the list of model - kit table to the README.md, in the Available basecalling models section under RNA Models?

vellamike commented 5 months ago

Thank you for the suggestion @StephDC - this seems like a good idea - we will aim to add it for the next release.

asan-emirsaleh commented 4 months ago

A duplex basecalling approach seems to be promising. A lot of labs produced data on r9.4.1 in years since the pore was introduced. I believe the amount of such data produced to be greater than that of 10.*. And it should also be noticed that a meaningful part of the data were not published yet because of the pure quality and shortcomings in processing. If you update basecalling models and introduce native official models for duplex basecalling for r9.4.1 data, all these researches would be finished and yielded in publications. As a side effect, it would improve the consensus opinion on the quality of data produced by ONT instruments in the wide community of genome researchers. Thus, I don't see it be out of the priority list. One can use duplex_tools but researchers like default solutions that work "out-of-the-box". I still hope, one day the models for r9.4.1 series products will be updated. Best regards Asan