nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
458 stars 56 forks source link

Specifying full path to Dorado basecaller model #928

Closed jennycmuscat closed 2 weeks ago

jennycmuscat commented 3 weeks ago

I am trying to run Dorado to identify 5mC methylations, but only have success when the full path to the basecaller model dna_r10.4.1_e8.2_400bps_hac@v5.0.0 is specified (from root). The command written below results in the error "Cannot find modification model for '5mC_5hmC' reason: simplex model doesn't exist", whilst when the full path (/home/.../software/dna_r10.4.1_e8.2_400bps_hac@v5.0.0) is used, no error occurs and it runs fine.

This was also attempted after saving the model dna_r10.4.1_e8.2_400bps_hac@v5.0.0 in a location included in my system's PATH environment variable, but resulted in the same issue. The same occurs when saving the model in the same location as dorado-0.7.1-linux-x64 too. The model simply does not seem to be found when running dorado when its full path is not specified.

Is there a way to not have to specify the full path to the basecaller model to run the command below?

Steps to reproduce the issue:

Running from a directory containing the software directory with the specified Dorado model: dorado basecaller software/dna_r10.4.1_e8.2_400bps_hac@v5.0.0 pod5_pass/barcode --reference sample.fa --verbose --batchsize 64 --device cuda:0 --modified-bases 5mC_5hmC > reads.bam

Run environment:

Logs

  terminate called after throwing an instance of 'std::runtime_error'
    what():  Cannot find modification model for '5mC_5hmC' reason: simplex model doesn't exist at: software/dna_r10.4.1_e8.2_400bps_hac@v5.0.0
HalfPhoton commented 2 weeks ago

Hi @jennycmuscat, apologies for the delay.

This is unusual - can you tell me if software/dna_r10.4.1_e8.2_400bps_hac@v5.0.0 is a symbolic link in any way?

Dorado should find models in the current working directory. So placing this model there should work.

Note: The model search behaviour is changing in a future release with the addition of the --models-directory CLI argument. This will be described in more detail in the release note.

Kind regards, Rich

jennycmuscat commented 2 weeks ago

No, it is not a symbolic link. The directory software is in my current working directory containing software I am running, including Dorado and the associated basecaller model dna_r10.4.1_e8.2_400bps_hac@v5.0.0. I have included this directory to my PATH variable, as well as tried to run the command specified when this model is in my current working directory (not in the software directory) but have not had any luck either way.

Is it required that dna_r10.4.1_e8.2_400bps_hac@v5.0.0 and dorado-0.7.1-linux-x64 are saved in the same directory? Or could I get more specifics regarding this - as I have tried this too but have so far not managed.

Thank you for the help!

HalfPhoton commented 2 weeks ago

Adding the model directory to the PATH variable will have no effect - this isn't how the model search is implemented.

No the model does not need to be in the software directory - but it will be found if it's in the current working directory.

Can you also download the dna_r10.4.1_e8.2_400bps_hac@v5.0.0_5mCG_5hmCG@v1 model so you working directory looks like:

dna_r10.4.1_e8.2_400bps_hac@v5.0.0/
dna_r10.4.1_e8.2_400bps_hac@v5.0.0_5mCG_5hmCG@v1/
pod5_pass/
sample.fa
software/dorado-0.7.1-linux-x64/dorado

running in this directory the following should work:

./software/dorado-0.7.1-linux-x64/dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v5.0.0 pod5_pass/ --reference sample.fa --batchsize 64 --device cuda:0 --modified-bases 5mC_5hmC > reads.bam
jennycmuscat commented 2 weeks ago

I have realised that the issue I am having is only present when running the command in a Nextflow pipeline. The command you specify does indeed work on the command line - but the current working directory is no longer recognised in Nextflow. I understand that this is an issue beyond Dorado, I will keep using the full path when running the command in Nextflow. Regardless, thank you for your help.

HalfPhoton commented 2 weeks ago

but the current working directory is no longer recognised in Nextflow

The CWD for a nextflow process contains only what's specified in the inputs so you'll need to make sure the model is added here so that it's staged into CWD for each job. This will be a symbolic link (by default) but that shouldn't matter.

This will be made easier in a future release with the addition of --models-directory

All the best, Rich