nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
446 stars 54 forks source link

Order of positional arguments and options matters when specifying modified basecall models for downloaded models #744

Closed janpb closed 2 months ago

janpb commented 2 months ago

Issue Report

The position of the option --modified-bases matters when using downloaded models. dorado will abort otherwise.

This behavior is unexpected and not reflected in the dorado usage.

Steps to reproduce the issue:

Failing examples:

$: dorado duplex -r ../models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 --modified-bases 5mCG_5hmCG ../data
[2024-04-15 16:05:19.638] [info] Running: "duplex" "-r" "../models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0" "--modified-bases" "5mCG_5hmCG" "../data"
[2024-04-15 16:05:19.639] [error] '../data' is not a supported modification please select from 6mA, 5mC, m6A_DRACH, 5mCG_5hmCG, 5mCG, 5mC_5hmC
dorado duplex -r  --modified-bases "5mCG_5hmCG"  ../models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ../data/
[2024-04-15 16:23:43.337] [info] Running: "duplex" "-r" "--modified-bases" "5mCG_5hmCG" "../models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0" "../data/"
[2024-04-15 16:23:43.339] [error] '../models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0' is not a supported modification please select from 6mA, 5mC, m6A_DRACH, 5mCG_5hmCG, 5mCG, 5mC_5hmC

Working example:

$: dorado duplex -r  ../models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ../data/ --modified-bases 5mCG_5hmCG
[2024-04-15 15:55:48.975] [info] Running: "duplex" "-r" "../models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0" "../data/" "--modified-bases" "5mCG_5hmCG"
[2024-04-15 15:55:48.976] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-04-15 15:56:13.093] [info] cuda:0 using chunk size 9996, batch size 128
[2024-04-15 15:56:59.007] [info] cuda:0 using chunk size 10000, batch size 64
[2024-04-15 15:57:01.592] [info] > Starting Stereo Duplex pipeline

Run environment:

Logs

see examples above

HalfPhoton commented 2 months ago

Hi @janpb, thanks for the raising this issue. We are aware that users experience issues with the order of arguments. The cause of this issue is that --modified-bases accepts a variable number of arguments and will consume directly following positional arguments as you demonstrate in your three examples:

dorado duplex -r $model -modified-bases $mods $data # $data is consumed
dorado duplex -r -modified-bases $mods $model $data # $model and $data are consumed
dorado duplex -r $model $data -modified-bases $mods # works 

None of the examples in the readme suggest / recommend having arguments before the positional arguments - but we can work to make this clearer in the documentation.

We now recommend users use the automatic model complex which is simpler and more concise.

In your examples, this would be:

dorado duplex sup,5mCG_5hmCG ../data -r

Kind regards, Rich