HERRO correction error with dorado 0.8.0 output (qlen from before and after don't match)

lizakulaeva commented 3 weeks ago

Hi, I have run dorado 0.8.0 simplex basecalling with the sup model, and now I am trying to perform the read correction with dorado correct. I have mentioned these swarm of errors appeared in the log file of correction script:

[2024-09-26 20:03:53.034] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2024-09-26 20:03:53.136] [info]  - downloading herro-v1 with httplib
[2024-09-26 20:03:53.323] [error] Failed to download herro-v1: SSL server verification failed
[2024-09-26 20:03:53.323] [info]  - downloading herro-v1 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100 22.3M  100 22.3M    0     0  43.7M      0 --:--:-- --:--:-- --:--:-- 43.8M
[2024-09-26 20:03:56.255] [info] Using batch size 28 on device cuda:0 in inference thread 0.
[2024-09-26 20:03:56.255] [info] Using batch size 28 on device cuda:0 in inference thread 1.
[2024-09-26 20:07:29.629] [info] Starting
[2024-09-27 05:01:03.198] [error] qlen from before 6111 and qlen from after 6113 don't match for 1292474e-961d-4fc7-a47f-518e559cca7d
[2024-09-27 05:01:03.198] [error] qlen from before 21775 and qlen from after 21778 don't match for 950ab2d3-b71b-4c74-a9f9-8b6cb33686f1
[2024-09-27 05:01:03.199] [error] qlen from before 17766 and qlen from after 17762 don't match for de910085-8f75-4c56-825a-ea19d25c4104
[2024-09-27 05:01:03.205] [error] qlen from before 6987 and qlen from after 6984 don't match for a5896a8a-ccaa-46db-aa05-02787ce1b97b
[2024-09-27 05:01:03.205] [error] qlen from before 19548 and qlen from after 19546 don't match for 60817dd4-7688-47b3-9a8f-b205733138e4
[2024-09-27 05:01:03.205] [error] qlen from before 5411 and qlen from after 5401 don't match for aa4407f0-2646-4f24-bc60-16bcd1711a17

...and it goes for every ID.

Is it a critical warning that will influence the corrected output? If yes, is there a way to solve it? Thank you in advance!

Run environment:

Dorado version: 0.8.0
Dorado command: dorado correct --device cuda:all output.fastq > output_corrected.fasta
Source data type: pod5
Operating system: Linux

Hardware (CPUs, Memory, GPUs): From SLURM script:

#SBATCH -t 60:0:0
#SBATCH --partition=gpu
#SBATCH --gpus=7g.79gb:1
#SBATCH --cpus-per-task=4
#SBATCH --gres=tmpspace:60G
#SBATCH --mem 300G

svc-jstone commented 3 weeks ago

Hi @lizakulaeva!

This is a strange issue, it indicates that the length of a sequence in the overlap stage does not match the length of the sequence fetched from the FASTQ in the inference stage. But considering your command line (dorado correct --device cuda:all output.fastq > output_corrected.fasta), the same set of sequences is used in both cases.

Can you please check that there aren't duplicate sequence names in your output.fastq file? How was this file generated?

Thanks!

lizakulaeva commented 3 weeks ago

Hi @svc-jstone I've found a reason for this error message: it was because I've used the old .fai index file and forgot to delete it before running correction for the new dorado output. Hope my rookie mistake will help someone :) I'm closing this issue now.

svc-jstone commented 2 weeks ago

Hi @lizakulaeva, thank you very much for reporting this back!

nanoporetech / dorado

HERRO correction error with dorado 0.8.0 output (qlen from before and after don't match) #1044