StephDC commented 1 month ago

Issue Report

Please describe the issue:

When I want to download a model for basecalling, dorado download gets all models that can work with my pod5 data file.

Using --model sup can save me some bandwidth to not download the fast and hac models, but I cannot find any method to avoid downloading the modbase models and duplex models with the base model.

I would like to only download the model file for simplex basecalling, so that whenever the next file comes in, I can use such model file to basecall it instead of downloading the model one more time.

Steps to reproduce the issue:

After running MinKNOW to get some pod5 files, run the following command to download the model for basecalling:

$ dorado download --data MyData/pod5/FUK_01234567_fedcba98_0.pod5 --model sup
$ ls

Then instead of getting the sup model, and only the sup model, for basecalling as if I was just doing dorado basecaller sup MyData/pod5/FUK_01234567_fedcba98_0.pod5, I got all of the below:

dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_5hmC@v1      dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mC_5hmC@v1    dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mC_5hmC@v1
dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v2    dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mCG_5hmCG@v1  dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1
dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v3.1  dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v1         dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v1
dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2           dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2         dna_r10.4.1_e8.2_5khz_stereo@v1.3
dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v2           dna_r10.4.1_e8.2_400bps_sup@v5.0.0
dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v3           dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v1

Run environment:

Dorado version: v0.7.0
Dorado command: dorado download --data MyData/pod5/FUK_01234567_fedcba98_0.pod5 --model sup
Operating system: RHEL 8.9, Ubuntu 22.04
Hardware (CPUs, Memory, GPUs): Irrelevant (oVirt)
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): Irrelevant
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
- Flowcell: FLO-MIN114
- Kit: SQK-RPB114-24
- Other information: Irrelevant
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

HalfPhoton commented 1 month ago

Hi @StephDC, This is the intended behaviour.

To download a specific model - use the specific name.

dorado download --model dna_r10.4.1_e8.2_400bps_sup

StephDC commented 1 month ago

What if all I have is the pod5 file, thus I cannot look up on which flowcell, which kit, which frequency etc my file is on to figure out which model file works, and I need to download the model and keep it for further processing?

I am processing the pod5 files as they arrive for simplex basecalling without modbase. What I am currently doing is basically calling dorado basecaller once per file. However this approach results in the model file being downloaded and removed once per pod5 file I process. Thus I am considering using dorado download with the data file when I see it for the first time, download the model, and keep supplying its path to the basecaller to save some bandwidth.

If as you mentioned that this is the intended behavior, I am currently considering one of the following approaches:

Do a full basecalling for the first file, and either as soon as the model file is downloaded into the hidden directory, make a copy of it and start using this copy, or capture and parse the stdout to figure out which model I was using, then call dorado download to get the model file. A new command line switch that keeps the model file in a certain specified directory after the basecall may also help.

OR

Use the command I currently have to download all models, and remove everything that contains two v[0-9] as well as the one that contains stereo. Check if only one directory remains, and proceed with such directory as the model for all data files.

Any suggestions on which method may work better is greatly welcomed.

HalfPhoton commented 1 month ago

I'm currently working on implementing this enhancement : https://github.com/nanoporetech/dorado/issues/681 which will add the option for the user to specify a directory from which models will be searched from and downloaded into (if they're missing). So if you're happy to wait for a future release of dorado then this will be available to you without any additional work on your part.

If you need a solution today though how about this:

Dorado will not download files that already exist in the current working directory - therefore if you download the files once to some central location, and symbolically link them all into your CWD then it will find them and use them without downloading them again. This would only be a few simple lines of bash something like:

for MODEL in $(find /path/to/models --type d --iname "dna*"); do 
  ln -s $MODEL /path/to/basecalling/directory/
done

StephDC commented 4 weeks ago

Ahh the "current working directory".

I see why I was keep downloading it even though I had a copy of all models installed with MinKNOW now. I was in the same directory as the expected input file.

Maybe what I should really do is to cd into the directory containing the models, then do all of the basecalling from there referring the input and output by their path.

Just as a side note, is it possible to let guppy basecaller keep the model file instead of removing it, if it was downloaded afresh, after the basecall?

HalfPhoton commented 4 weeks ago

Maybe what I should really do is to cd into the directory containing the models, then do all of the basecalling from there referring the input and output by their path.

Yes, that's also a sensible solution.

Just as a side note, is it possible to let ~~guppy~~ dorado basecaller keep the model file instead of removing it, if it was downloaded afresh, after the basecall?

I think this will be the default behaviour of the new model directory feature 👍

I think the issue is resolved so I'm going to close this issue in favour of the original feature request in: https://github.com/nanoporetech/dorado/issues/681

If you have any more trouble please re-open this ticket or create a new one.

Happy basecalling, Rich

nanoporetech / dorado

`dorado download` get only the sup model instead of all models #856

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment: