nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
487 stars 59 forks source link

resume file not found #284

Closed mattloose closed 11 months ago

mattloose commented 1 year ago

Running --resume-from BAMFILE.BAM gives a filesystem error - cannot make canonical path: No such file or directory [-x]

Am I doing something wrong?

vellamike commented 1 year ago

Hi Matt, what's the full command you are running?

mattloose commented 1 year ago

I took the previous command:

dorado basecaller -x cuda:all -r dorado-0.3.1-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup\@v4.2.0  ../path/to/pod5 --modified-bases 5mCG_5hmCG > calls.bam

and switched it to:

dorado basecaller -x cuda:all -r dorado-0.3.1-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup\@v4.2.0  ../path/to/pod5 --modified-bases 5mCG_5hmCG --resume-from calls.bam > calls_continued.bam
tijyojwad commented 1 year ago

Hi @mattloose - can you post the SAM header from your first basecaller command?

tijyojwad commented 1 year ago

we have a parsing bug which doesn't play well with optional arguments being before the positional args in the original cmdline. one workaround is to update the header manually and copy over the remaining records and then use resume from that BAM.

Basically the CL key in the PG line in your BAM will have

dorado basecaller -x cuda:all -r dorado-0.3.1-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup\@v4.2.0

but the code is expecting it to be dorado basecaller dorado-0.3.1-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup\@v4.2.0 <data> <optional args>

something like this should do the trick

samtools view -H calls.bam > tmp.sam

edit tmp.sam to (1) remove the PG line added by samtools view and (2) move the optional args before model name in the CL key to the end. Then

samtools reheader tmp.sam calls.bam > calls_fixed.bam

after that, you can resume from calls_fixed.bam

mattloose commented 1 year ago

OK - will test today!

PJV-Ecu commented 1 year ago

I have the same problem. Could you please explain in further detail how to do the modification to the header?

After running

samtools view -H calls.bam > tmp.sam

the header in "tmp.sam" looks like this:

@PG ID:samtools PN:samtools VN:1.17 CL:samtools view -H incomplete.bam

I modify the header like this:

@PG ID:samtools PN:samtools VN:1.17 CL:dorado basecaller dorado-0.3.1-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup\@v4.2.0 -H incomplete.bam

After trying to apply the fix, with:

samtools reheader tmp.sam calls.bam > calls_fixed.bam

I get the following error:

samtools reheader: input file 'calls.bam' must be BAM or CRAM

I think I may not be modifying the header correctly. Any suggestions? Thank you!

tijyojwad commented 1 year ago

Hi @PJV-Ecu - this issue is fixed in version 0.3.2. Could you update the binary and test again?

PJV-Ecu commented 1 year ago

Dear @tijyojwad - thanks for providing support. My first version of Dorado is 0.3.2+d8660a3. I reinstalled just in case.

The electric supply was gone in the middle of the basecalling process (to my dismay) and I have been trying recovery of the invested hours with the "--resume-from" option. I am unable to recover the .bam file header with the provided instructions. Please let me know if you have any further advice. Thank you!

PJV-Ecu commented 1 year ago

I tried again from scratch and the process was killed after 2 days of processing. I cannot restart from the incomplete .bam file, as the error persists:

[2023-07-30 14:22:39.967] [error] Required key CL not found in header of calls.bam

I'm using Dorado version 0.3.2+d8660a3

tijyojwad commented 1 year ago

Hi @PJV-Ecu - can you post your original and resume cmd? I'm trying some tests locally and I'm able to resume (simplex basecalling)

PJV-Ecu commented 1 year ago

Dear tijyojwad,

Thanks for the provided options and insights. Unfortunately, I have checked with my coauthors and they are adamant about releasing a whole unpublished bacterial genome.

I have looked into the recommendation provided by @vellamike here:

https://github.com/nanoporetech/dorado/issues/320#issuecomment-1664709541

which consists of (literal copy):

  1. Divide the Data: Break the data into smaller POD5s that can be processed individually (Your data may already be in multiple small POD5s?), and place one or more POD5s into their own directory.
  2. Run Parallel Jobs: Execute multiple Dorado instances on these directories of POD5s, thus ensuring each job consumes a manageable amount of memory and runs for a limited time.
  3. Merge Results: Combine the resultant BAM files as needed.

Is this advisable and consistent with Dorado's algorithm?

Do you advise on trying version 0.3.4. ?:

https://cdn.oxfordnanoportal.com/software/analysis/dorado/preview/dorado-0.3.4-rc1-linux-x64.tar.gz

Thank you

tijyojwad commented 1 year ago

Hi @PJV-Ecu - indeed that advice still holds true and is the recommended setup to make your basecalling runs more robust. In case any of those split runs fail, you have to resume or re-basecall a much smaller file compared to the whole dataset.

As for the resume feature, without your particular repro case it's hard to debug what's going on. Perhaps you could try resume with another unrestricted dataset, and if it doesn't work you could share that?

tijyojwad commented 11 months ago

Closing due to inactivity. @PJV-Ecu we released a new version of dorado (v0.4.0) in case you are still running into issues and want to give it a try.