nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

medaka_haploid_variant model for guppy v 6.4.8 #438

Closed katefgit closed 8 months ago

katefgit commented 1 year ago

which model from the following options should I select, when I have basecalled using the newest guppy version for R10.4.1 run with 400bs hac settings?

-m medaka model, (default: r1041_e82_400bps_sup_variant_v4.2.0). Choices: r103_fast_variant_g507 r103_hac_variant_g507 r103_prom_variant_g3210 r103_sup_variant_g507 r1041_e82_260bps_fast_variant_g632 r1041_e82_260bps_hac_variant_g632 r1041_e82_260bps_hac_variant_v4.1.0 r1041_e82_260bps_sup_variant_g632 r1041_e82_260bps_sup_variant_v4.1.0 r1041_e82_400bps_fast_variant_g615 r1041_e82_400bps_fast_variant_g632 r1041_e82_400bps_hac_variant_g615 r1041_e82_400bps_hac_variant_g632 r1041_e82_400bps_hac_variant_v4.1.0 r1041_e82_400bps_hac_variant_v4.2.0 r1041_e82_400bps_sup_variant_g615 r1041_e82_400bps_sup_variant_v4.1.0 r1041_e82_400bps_sup_variant_v4.2.0 r104_e81_fast_variant_g5015 r104_e81_hac_variant_g5015 r104_e81_sup_variant_g610 r941_e81_fast_variant_g514 r941_e81_hac_variant_g514 r941_e81_sup_variant_g514 r941_min_fast_variant_g507 r941_min_hac_variant_g507 r941_min_sup_variant_g507 r941_prom_fast_variant_g507 r941_prom_hac_variant_g507 r941_prom_sup_variant_g507 r941_prom_variant_g303 r941_prom_variant_g322 r941_prom_variant_g360 r941_sup_plant_variant_g61.

cjw85 commented 1 year ago

r1041_e82_400bps_hac_variant_v4.1.0

The model nomenclature has changed recently, with the final part of the medaka model name relating to the basecaller model version rather than the Guppy version number. You will need to look at the Guppy changelogs to keep track of these.

katefgit commented 1 year ago

thanks for the comment. Where in the guppy log file, can I find this info? and what's the difference between v4.1.0 and v4.2.0?

image
cjw85 commented 1 year ago

The version 4.2 models are for 5kHz sampling rate announced recently at London Calling.

nextgenusfs commented 1 year ago

@cjw85 is it possible to put a table or something on the README of which guppy base calling models correspond to which medaka models? If I'm honest, the difference in the naming schemes is annoying. Effectively it requires me to hardcode a lookup in some of our automation code to map the guppy model file to the one to use for medaka and knowing that v4.2 == 5kHz isn't inherently obvious.... Not sure on the easiest general solution for this, but at least if documented in an obvious way on the README where could be updated when a new release is pushed that would be helpful. Ideally it would be nice to be able to pass the guppy model file to medaka and have it choose the appropriate model, ie --guppy-model template_r10.4.1_e8.2_400bps_5khz_hac.jsn which use the r1041_e82_400bps_hac_v4.2.0 model. Or possibly re-use the --model option in to auto lookup the proper model if a guppy model is passed.

yzhang-github-pub commented 1 year ago

Totally agree @nextgenusfs. Medaka, guppy and other tools are excellent, and training the models takes a lot of effort. But the potential impact of the tools/models is greatly reduced if downstream cannot pick the right models due to the lack of or insufficient documentation.

katefgit commented 10 months ago

Hi again,

I am coming back to this, as my guppy logs I cann't figure out if the 5khz or 4khz has been used (only showing this /opt/ont/guppy/data/dna_r10.4.1_e8.2_400bps_hac.cfg). Is the 5khz the default and could it be because in the lab we often run basecalling using MinKNOW and therefore can't have control of the specific guppy model. I guess that selecting image is the 5khz matching with the v4.2, since the is another option with 4khz

image
cjw85 commented 10 months ago

@katefgit your data is 5kHz.

cjw85 commented 10 months ago

@nextgenusfs

I agree this has become a complete mess, particularly as we are transitioning from Guppy to dorado. This table may be of use, though its incomplete as it doesn't take account of the fact that some models were updated across Guppy versions (it is correct provided a "recent" version of Guppy is used, but don't ask me to quantify "recent"!).

The original naming of medaka models was intended to provide an interface logically equivalent to your --guppy-model idea. The issue is that Guppy models were not historically versioned separately to Guppy itself --- hence the somewhat obtuse guidance in the README.

Dorado introduced a cleaner notion of model version, which helps as there is a more direct correspondence between the dorado model names and medaka model names.

cjw85 commented 8 months ago

Automatic model selection has been implemented in medaka v1.11.0, which should mostly alleviate these difficulties. See this section of the README.