Closed StephDC closed 4 weeks ago
Hi @StephDC, This is the intended behaviour.
To download a specific model - use the specific name.
dorado download --model dna_r10.4.1_e8.2_400bps_sup
What if all I have is the pod5 file, thus I cannot look up on which flowcell, which kit, which frequency etc my file is on to figure out which model file works, and I need to download the model and keep it for further processing?
I am processing the pod5 files as they arrive for simplex basecalling without modbase. What I am currently doing is basically calling dorado basecaller
once per file. However this approach results in the model file being downloaded and removed once per pod5 file I process. Thus I am considering using dorado download
with the data file when I see it for the first time, download the model, and keep supplying its path to the basecaller to save some bandwidth.
If as you mentioned that this is the intended behavior, I am currently considering one of the following approaches:
Do a full basecalling for the first file, and either as soon as the model file is downloaded into the hidden directory, make a copy of it and start using this copy, or capture and parse the stdout to figure out which model I was using, then call dorado download
to get the model file. A new command line switch that keeps the model file in a certain specified directory after the basecall may also help.
OR
Use the command I currently have to download all models, and remove everything that contains two v[0-9]
as well as the one that contains stereo
. Check if only one directory remains, and proceed with such directory as the model for all data files.
Any suggestions on which method may work better is greatly welcomed.
I'm currently working on implementing this enhancement : https://github.com/nanoporetech/dorado/issues/681 which will add the option for the user to specify a directory from which models will be searched from and downloaded into (if they're missing). So if you're happy to wait for a future release of dorado then this will be available to you without any additional work on your part.
If you need a solution today though how about this:
Dorado will not download files that already exist in the current working directory - therefore if you download the files once to some central location, and symbolically link them all into your CWD then it will find them and use them without downloading them again. This would only be a few simple lines of bash something like:
for MODEL in $(find /path/to/models --type d --iname "dna*"); do
ln -s $MODEL /path/to/basecalling/directory/
done
Ahh the "current working directory".
I see why I was keep downloading it even though I had a copy of all models installed with MinKNOW now. I was in the same directory as the expected input file.
Maybe what I should really do is to cd into the directory containing the models, then do all of the basecalling from there referring the input and output by their path.
Just as a side note, is it possible to let guppy basecaller keep the model file instead of removing it, if it was downloaded afresh, after the basecall?
Maybe what I should really do is to cd into the directory containing the models, then do all of the basecalling from there referring the input and output by their path.
Yes, that's also a sensible solution.
Just as a side note, is it possible to let
guppydorado basecaller keep the model file instead of removing it, if it was downloaded afresh, after the basecall?
I think this will be the default behaviour of the new model directory feature 👍
I think the issue is resolved so I'm going to close this issue in favour of the original feature request in: https://github.com/nanoporetech/dorado/issues/681
If you have any more trouble please re-open this ticket or create a new one.
Happy basecalling, Rich
Issue Report
Please describe the issue:
When I want to download a model for basecalling,
dorado download
gets all models that can work with my pod5 data file.Using
--model sup
can save me some bandwidth to not download thefast
andhac
models, but I cannot find any method to avoid downloading the modbase models and duplex models with the base model.I would like to only download the model file for simplex basecalling, so that whenever the next file comes in, I can use such model file to basecall it instead of downloading the model one more time.
Steps to reproduce the issue:
After running MinKNOW to get some pod5 files, run the following command to download the model for basecalling:
Then instead of getting the sup model, and only the sup model, for basecalling as if I was just doing
dorado basecaller sup MyData/pod5/FUK_01234567_fedcba98_0.pod5
, I got all of the below:Run environment:
v0.7.0
dorado download --data MyData/pod5/FUK_01234567_fedcba98_0.pod5 --model sup
RHEL 8.9
,Ubuntu 22.04