nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

Which model? #389

Closed tnn111 closed 1 year ago

tnn111 commented 1 year ago

I sequenced one sample using 2 R9.4.1 flow cells along with 2 R10.4 flow cells. Then I used all of the fastq files for an assembly using flye.

Is it possible to run medaka to error correct the whole thing? If so, what model should I use? Attached is headers from the two types of fastq files:

@d7af3dcc-38f8-453e-8624-8f4f4a613314 runid=a6e19b866fc18cbf71e0125d1cefadc4576e48be sampleid=no_sample read=12671 ch=469 start_time=2022-09-05T02:31:28Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637

@a0ec37d5-cf70-46c9-b259-c7a01c347e39 runid=307ec030150e0f1aad2d03701e893fe1faf0fe26 read=14 ch=2049 start_time=2022-09-09T18:56:26.495449+00:00 flow_cell_id=PAM35393 protocol_group_id=X0217 sample_id=Station32b parent_read_id=a0ec37d5-cf70-46c9-b259-c7a01c347e39 basecall_model_version_id=2021-09-03_dna_r10.4_minion_promethion_384_6b8e75c7

Thanks!

cjw85 commented 1 year ago

We used to experimentally support use of multiple datatypes in medaka, but we longer have the models trained to do this.

What depth of sequencing do you have for the two flowcell types?

wn835166087 commented 1 year ago

I have a similar question: i have two batch of sequences and i intend to co-assemble the sequences. They used the same flow cell but different base caller versions: one is 5.0.16+b9fcd7b using sup mode, the other is 5.0.11+2b6dbff using the hat mode. should I choose the version with "sup" or "hac"? should I choose g5015 or g507? there is a model version called "r104_e81_sup_g5015", what does the e81 mean? If the model should not be used for different modes, can i assemble individually first, then use medaka to correct the reads, then assemble them again? Thank you!

Ashma45 commented 1 year ago

Can you suggest which model can be used in the existing model list for medaka consensus with flow cell 10.4, Guppy version 6.3.8, and super accurate basecalling? Is it r104_e81_sup_g610?

r104 e81 consensus

'r104_e81_fast_g5015', 'r104_e81_sup_g5015', 'r104_e81_hac_g5015',
'r104_e81_sup_g610
Ashma45 commented 1 year ago

I have a similar question: i have two batch of sequences and i intend to co-assemble the sequences. They used the same flow cell but different base caller versions: one is 5.0.16+b9fcd7b using sup mode, the other is 5.0.11+2b6dbff using the hat mode. should I choose the version with "sup" or "hac"? should I choose g5015 or g507? there is a model version called "r104_e81_sup_g5015", what does the e81 mean? If the model should not be used for different modes, can i assemble individually first, then use medaka to correct the reads, then assemble them again? Thank you!

Hello, did you get your answer about e81?

cjw85 commented 1 year ago

@wn835166087 and @Ashma45,

You are in unchartered waters. Mixing data from two basecaller variants is not something we have tested and therefore would not advise. If possible try to basecall all your data consistently with a single basecaller.