nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
438 stars 53 forks source link

dorado correct #838

Open youngnd opened 1 month ago

youngnd commented 1 month ago

Issue Report

Please describe the issue:

I downloaded dorado 0.7.0 and tried to use the correct option. " - downloading herro-v1 with httplib ./dorado correct [2024-05-24 09:46:55.603] [info] Running: "correct" [2024-05-24 09:46:55.604] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt [2024-05-24 09:46:55.607] [info] - downloading herro-v1 with httplib

Segmentation fault (core dumped)" . I know that herro recently switched their web address for model downloads. COuld it be something to do with this?

Please provide a clear and concise description of the issue you are seeing and the result you expect. I expected the correct options to be displayed so i could errror correct my reads with herro after basecalling my 10.4 flow cell data.

Steps to reproduce the issue:

downloaded the software twice and re-ran in user and admin mode with the same outcome. sudo ./dorado correct [2024-05-24 09:46:55.603] [info] Running: "correct" [2024-05-24 09:46:55.604] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt [2024-05-24 09:46:55.607] [info] - downloading herro-v1 with httplib

Please list any steps to reproduce the issue.

Run environment:

Logs

sudo ./dorado correct [2024-05-24 09:46:55.603] [info] Running: "correct" [2024-05-24 09:46:55.604] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt [2024-05-24 09:46:55.607] [info] - downloading herro-v1 with httplib

tijyojwad commented 1 month ago

Hi @youngnd - is dorado correct your full command? you need to pass in a fastq file to correct as well.

dorado correct <reads.fastq(.gz)>

we will add better error checking to catch and report this

colindaven commented 1 month ago

Thanks for adding this to dorado, I'm hoping for greatly improved results.

I'm getting issues with the download too. I'll add curl to the container ( I have it on my server but not in the container) but imagine it will fail due to the rabid proxy.

So please give us a direct download link for the herro model. Then we can use the dorado correct argument to link to the downloaded model path, which will likely work well on a cluster.

Thanks !

sing exec image.sif dorado correct
[2024-05-24 14:01:57.876] [info] Running: "correct"
[2024-05-24 14:01:57.900] [info]  - downloading herro-v1 with httplib
[2024-05-24 14:03:18.002] [error] Failed to download herro-v1: Could not establish connection
[2024-05-24 14:03:18.003] [info]  - downloading herro-v1 with curl
sh: 1: curl: not found
[2024-05-24 14:03:18.007] [error] Failed to download herro-v1: ret=32512, errno=0
[2024-05-24 14:03:18.007] [error] Could not download model: herro-v1

Edit - but after adding curl to the container, I get a segfault, and non-connection message, but the file seems to be there. Is this ok?

27M │ ├── herro.pt

sing exec image.sif dorado correct
[2024-05-24 15:20:36.174] [info] Running: "correct"
[2024-05-24 15:20:36.192] [info]  - downloading herro-v1 with httplib
[2024-05-24 15:21:56.289] [error] Failed to download herro-v1: Could not establish connection
[2024-05-24 15:21:56.289] [info]  - downloading herro-v1 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.3M  100 22.3M    0     0  3469k      0  0:00:06  0:00:06 --:--:-- 3501k
Segmentation fault (core dumped)

~/programs/herro$ dust
  32K   ┌── .temp_dorado_model-3fecf49c731bcdaf│                                                                                                                                                                       █ │   0%
  24K   │   ┌── config.toml                    │                                                                                    ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒█ │   0%
  27M   │   ├── herro.pt                       │                                                                                    ████████████████████████████████████████████████████████████████████████████████████ │  50%
  27M   │ ┌─┴ herro-v1                         │                                                                                    ████████████████████████████████████████████████████████████████████████████████████ │  50%
  27M   ├─┴ .temp_dorado_model-6a608efaca49cf7a│                                                                                    ████████████████████████████████████████████████████████████████████████████████████ │  50%
  24K   │   ┌── config.toml                    │                                                                                    ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒█ │   0%
  27M   │   ├── herro.pt                       │                                                                                    ████████████████████████████████████████████████████████████████████████████████████ │  50%
  27M   │ ┌─┴ herro-v1                         │                                                                                    ████████████████████████████████████████████████████████████████████████████████████ │  50%
  27M   ├─┴ .temp_dorado_model-dcca95896244d3b7│                                                                                    ████████████████████████████████████████████████████████████████████████████████████ │  50%
  55M ┌─┴ .                 
tijyojwad commented 1 month ago

Hi @colindaven - the model has now downloaded correctly.

You need to run dorado correct with a path to a reads.fastq file to correct.

Note that dorado correct doesn't work with piped data - it needs to be given a FASTX file with reads.

youngnd commented 1 month ago

Thanks to ON support (Steven) This worked for me. dorado download --model herro-v1 And then specifying the model path manually as an argument (-m model-path) when you use the command: dorado correct -m herro-v1 reads.fastq.gz > corrected.fasta Note: It was unable to index gzipped files prepared using pigz (default ). It needs bgz files apparently

tijyojwad commented 1 month ago

That's a good point. Will clarify in the docs that the input file needs to be bgzipped.

fen2323 commented 3 weeks ago

I am also getting an index error, [error] Could not create/load index for FASTx file. Below is the code I am running

!/bin/bash

set -e

herro_model=~/dorado-0.7.0-linux-x64/bin/herro-v1 input=/data/AQ924/dorado_v5/output/AQ924_v5.fa.bgz output=/data/AQ924/dorado_clean/output/AQ924_clean.fasta log=/data/AQ924/dorado_clean/output/AQ924_clean.log

mkdir -p /data/AQ924/dorado_clean/output echo "Running Dorado" nohup ~/dorado-0.7.0-linux-x64/bin/dorado correct -m $herro_model $input -v > $output 2> $log &

I started by re-running the pod5s for this project using the newest dorado v5 release, output into bam form.
Converted bam to fasta, using samtools fasta -c 6 AQ924_v5.bam > AQ924_v5.fa.bgz

Any help would be greatly appreciated.

tijyojwad commented 3 weeks ago

Hi @fen2323 is the folder where the input is writable by the process?

tijyojwad commented 3 weeks ago

can you also try to run dorado correct on the fastq without compression?

fen2323 commented 3 weeks ago

@tijyojwad Yes the folder is writable. After re-running dorado basecaller with the emit fastq tag, it is now working. It will not work on the fastq file that was converted from the original basecaller .bam output.

fen2323 commented 3 weeks ago

@tijyojwad Another question, using the newest version of dorado, should I redo basecalling for both the passed and failed pod5 files? Will it then automatically perform the filtering based on quality of the basecalling model I select? Or should I just stick to the pod5 passed from the original run?

tijyojwad commented 3 weeks ago

@fen2323

It will not work on the fastq file that was converted from the original basecaller .bam output.

hmm interesting, we haven't tested this path. I find that surprising though - is it not working on the uncompressed fastq or the compressed one (what you reported in your script)?

Dorado doesn't do any default filtering on Q scores. If the goal is to rescue more reads from the original dataset (i.e. convert some fail reads to pass reads), then I would run basecalling on the whole dataset and set your own filtering parameters for Dorado to use. But if you just want to use the pass reads, the rebase calling just that folder should be enough.

fen2323 commented 3 weeks ago

@tijyojwad

@fen2323

It will not work on the fastq file that was converted from the original basecaller .bam output.

hmm interesting, we haven't tested this path. I find that surprising though - is it not working on the uncompressed fastq or the compressed one (what you reported in your script)?

It will only work if the fastq is the direct output of dorado (using the emit tag). I tried both the fastq and compressed fastq and got the same index error on both. I tried both bedtools and samtools to convert from bam to fastq.

Thank you for answering my other question too.

tijyojwad commented 3 weeks ago

Hi @fen2323

I've been able to do the following and generated a corrected output

$ dorado basecaller <model> <pod5> > output.bam
$ samtools bam2fq output.bam > output.fastq
$ dorado correct output.fastq > corrected.fastq

and I just realized the issue with your command. dorado correct needs a fastq (our documentation needs to be updated to reflect that). so you'll need to specify

 samtools fastq -c 6 AQ924_v5.bam > AQ924_v5.fq.bgz

(although when I ran it I just got a fastq, not a compressed fastq)

fen2323 commented 3 weeks ago

@tijyojwad

I have some more data to run, I will try it again as you have shown and see if I can get it to work. Thank you

colindaven commented 3 weeks ago

I can get dorado correct working now with the new version and with dorado in a singularity container with jobs started by nextflow. I was having trouble since I forgot the --nv parameter to allow singularity to access the gpu, but all good now. Thanks for packaging herro up in dorado.

fen2323 commented 1 week ago

@tijyojwad Do you know if it is possible to run basecalling and dorado correct at the same time on a PromethION tower? Second question, is it possible to run dorado correct and utilize GPU vs CPU?