nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
488 stars 59 forks source link

Dorado: no progress indicator? #621

Closed legoscientist closed 5 months ago

legoscientist commented 7 months ago

I'm not getting very much information about progress during basecalling - ie no percentage to give me an indication as to how long basecalling might take to complete

Run environment:

tijyojwad commented 7 months ago

Hi @legoscientist - can you post the logs from your run? Could you also do a run with -v and report the output?

HalfPhoton commented 7 months ago

Similar to: https://github.com/nanoporetech/dorado/issues/503 ?

legoscientist commented 7 months ago

@HalfPhoton no - I get no percentage indication whatsoever

@tijyojwad: will do a -v run shortly and report the outcome

Mon Feb 12 14:22:05 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   28C    P0    27W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[2024-02-12 14:22:06.993] [info]  - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-02-12 14:22:07.005] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-02-12 14:22:07.009] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-02-12 14:22:12.921] [info] > Creating basecall pipeline
[2024-02-12 14:22:27.810] [info]  - set batch size for cuda:0 to 576
[2024-02-12 14:22:27.821] [info] Barcode for SQK-RBK114-96
legoscientist commented 7 months ago

And with -v included:

0
Mon Feb 12 15:45:44 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:58:00.0 Off |                    0 |
| N/A   40C    P0    28W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[2024-02-12 15:45:56.081] [info]  - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-02-12 15:45:56.408] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-02-12 15:45:56.416] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-02-12 15:46:01.393] [info] > Creating basecall pipeline
[2024-02-12 15:46:56.408] [debug] cuda:0 memory available: 15.36GB
[2024-02-12 15:46:56.408] [debug] Auto batchsize cuda:0: memory limit 14.36GB
[2024-02-12 15:46:56.408] [debug] Auto batchsize cuda:0: testing up to 576 in steps of 64
[2024-02-12 15:46:57.672] [debug] Auto batchsize cuda:0: 64, time per chunk 7.323760 ms
[2024-02-12 15:46:58.609] [debug] Auto batchsize cuda:0: 128, time per chunk 3.642408 ms
[2024-02-12 15:46:59.566] [debug] Auto batchsize cuda:0: 192, time per chunk 2.482170 ms
[2024-02-12 15:47:00.517] [debug] Auto batchsize cuda:0: 256, time per chunk 1.856164 ms
[2024-02-12 15:47:01.469] [debug] Auto batchsize cuda:0: 320, time per chunk 1.485782 ms
[2024-02-12 15:47:02.425] [debug] Auto batchsize cuda:0: 384, time per chunk 1.243413 ms
[2024-02-12 15:47:03.446] [debug] Auto batchsize cuda:0: 448, time per chunk 1.058210 ms
[2024-02-12 15:47:04.402] [debug] Auto batchsize cuda:0: 512, time per chunk 0.928830 ms
[2024-02-12 15:47:05.365] [debug] Auto batchsize cuda:0: 576, time per chunk 0.830116 ms
[2024-02-12 15:47:05.365] [debug] Device cuda:0 Model memory 9.83GB
[2024-02-12 15:47:05.365] [debug] Device cuda:0 Decode memory 4.06GB
[2024-02-12 15:47:05.431] [info]  - set batch size for cuda:0 to 576
[2024-02-12 15:47:05.431] [debug] - adjusted chunk size to match model stride: 10000 -> 9996
[2024-02-12 15:47:05.438] [debug] Creating barcoding info for kit: SQK-RBK114-96
[2024-02-12 15:47:05.438] [info] Barcode for SQK-RBK114-96
[2024-02-12 15:47:05.439] [debug] - adjusted overlap to match model stride: 500 -> 498
[2024-02-12 15:47:05.445] [debug] Load reads from file /home/sbzrhc/in/pod5in/output_01.pod5
[2024-02-12 15:47:08.515] [debug] > Kits to evaluate: 1
[2024-02-12 15:59:21.308] [info] > Simplex reads basecalled: 28699
[2024-02-12 15:59:21.309] [info] > Simplex reads filtered: 8703
[2024-02-12 15:59:21.309] [info] > Basecalled @ Samples/s: 1.744464e+06
[2024-02-12 15:59:21.309] [info] > 28701 reads demuxed @ classifications/s: 3.900297e+01
[2024-02-12 15:59:21.406] [info] > Finished
tijyojwad commented 7 months ago

Thanks! Is the stderr being redirected to a file?

legoscientist commented 7 months ago

It wasn't, but I have set it to and it basically gives the same information as above (that was in the output file)

stderr file:

[2024-02-13 13:06:15.412] [info] - Note: FASTQ output is not recommended as not all data can be preserved. [2024-02-13 13:06:15.706] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt [2024-02-13 13:06:15.713] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-02-13 13:06:21.250] [info] > Creating basecall pipeline [2024-02-13 13:06:56.532] [debug] cuda:0 memory available: 15.36GB [2024-02-13 13:06:56.533] [debug] Auto batchsize cuda:0: memory limit 14.36GB [2024-02-13 13:06:56.533] [debug] Auto batchsize cuda:0: testing up to 576 in steps of 64 [2024-02-13 13:06:58.158] [debug] Auto batchsize cuda:0: 64, time per chunk 7.523056 ms [2024-02-13 13:06:59.121] [debug] Auto batchsize cuda:0: 128, time per chunk 3.750176 ms [2024-02-13 13:07:00.094] [debug] Auto batchsize cuda:0: 192, time per chunk 2.526757 ms [2024-02-13 13:07:01.069] [debug] Auto batchsize cuda:0: 256, time per chunk 1.891712 ms [2024-02-13 13:07:02.046] [debug] Auto batchsize cuda:0: 320, time per chunk 1.523702 ms [2024-02-13 13:07:03.025] [debug] Auto batchsize cuda:0: 384, time per chunk 1.271035 ms [2024-02-13 13:07:03.997] [debug] Auto batchsize cuda:0: 448, time per chunk 1.082032 ms [2024-02-13 13:07:04.975] [debug] Auto batchsize cuda:0: 512, time per chunk 0.951096 ms [2024-02-13 13:07:05.958] [debug] Auto batchsize cuda:0: 576, time per chunk 0.850089 ms [2024-02-13 13:07:05.958] [debug] Device cuda:0 Model memory 9.83GB [2024-02-13 13:07:05.958] [debug] Device cuda:0 Decode memory 4.06GB [2024-02-13 13:07:06.009] [info] - set batch size for cuda:0 to 576 [2024-02-13 13:07:06.009] [debug] - adjusted chunk size to match model stride: 10000 -> 9996 [2024-02-13 13:07:06.013] [debug] Creating barcoding info for kit: SQK-RBK114-96 [2024-02-13 13:07:06.013] [info] Barcode for SQK-RBK114-96 [2024-02-13 13:07:06.014] [debug] - adjusted overlap to match model stride: 500 -> 498 [2024-02-13 13:07:06.016] [debug] Load reads from file /home/sbzrhc/in/pod5in/output_01.pod5 [2024-02-13 13:07:09.004] [debug] > Kits to evaluate: 1 [2024-02-13 13:19:23.991] [info] > Simplex reads basecalled: 28699 [2024-02-13 13:19:23.992] [info] > Simplex reads filtered: 8703 [2024-02-13 13:19:23.992] [info] > Basecalled @ Samples/s: 1.739481e+06 [2024-02-13 13:19:23.992] [info] > 28701 reads demuxed @ classifications/s: 3.889156e+01 [2024-02-13 13:19:24.040] [info] > Finished

tijyojwad commented 7 months ago

Thanks! Are you looking at the final output after you run basecalling or observing that run as it's going? when redirecting stderr to a file we disable progress bar printing

The progress bar is removed from the terminal once basecalling is finished. e.g.

$ /dorado-0.5.3/bin/dorado basecaller models/dna_r10.4.1_e8.2_400bps_hac\@v4.3.0 /pod5s --max-reads 10000 -b 512 -x "cuda:0" > /dev/null 
[2024-02-13 16:33:58.500] [info] > Creating basecall pipeline
[██████▎                       ] 20% [00m:03s<00m:11s] 

but once it's finished, the output is

$ /dorado-0.5.3/bin/dorado basecaller models/dna_r10.4.1_e8.2_400bps_hac\@v4.3.0 /pod5s --max-reads 10000 -b 512 -x "cuda:0" > /dev/null 
[2024-02-13 16:33:58.500] [info] > Creating basecall pipeline
[2024-02-13 16:34:17.468] [info] > Simplex reads basecalled: 10000
[2024-02-13 16:34:17.468] [info] > Simplex reads filtered: 1
[2024-02-13 16:34:17.468] [info] > Basecalled @ Samples/s: 6.645240e+07
[2024-02-13 16:34:17.482] [info] > Finished
legoscientist commented 7 months ago

I have been trying to observe it while running and haven't been getting anything like what you are seeing on the first thing you posted

dorado

Then I get nothing until the run is completed when refreshing the "less slurm-[job number].out" command (and that was without the stderr redirect) , and then I get the lines about simplex reads basecalled etc once finished.

I'm not massively experienced with this, hence why I am struggling! I have also put my script below so that you can see what I am using!

#!/bin/bash
#SBATCH --partition=voltaq
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:1
#SBATCH --mem=116g
#SBATCH --time=3-00:00:00

cd /home/sbzrhc/in/dorado-0.5.2/bin

export CUDA_VISIBLE_DEVICES=`/software/gpucheck/gpuuse.sh`
echo $CUDA_VISIBLE_DEVICES
nvidia-smi

# Script for basecalling Nanopore data with GPU acceleration
# through Dorado
# 02/02/2024

# Specify input and output path:
IN=/home/sbzrhc/in/pod5in/output_01.pod5 
OUT=/home/sbzrhc/in/pod5out/test2.bam

mkdir -p /home/sbzrhc/in/pod5out

#default mode is to run on a pod folder as last argument and write the data to the screen so we > (redirect) output to a file
./dorado basecaller -v sup --min-qscore 10 --kit-name SQK-RBK114-96 --trim primers --device cuda:0 $IN > $OUT
tijyojwad commented 7 months ago

Then I get nothing until the run is completed when refreshing the "less slurm-[job number].out"

Looks like slurm is redirecting stderr to a file (to slurm-[job number].out. So dorado progress bar will be disabled.

legoscientist commented 7 months ago

it doesn't do that with guppy though - the only difference here is that dorado is installed locally, guppy is installed as a module on hpc? Hoping that Dorado will be installed as a module on the new HPC when it is launched....

tijyojwad commented 5 months ago

Sorry for the late response. Unfortunately I won't be able to help with how your HPC is configured, but if the output of dorado is redirected to a file we disable the progress bar (otherwise the progress bar will spam the entire file and the logs will become useless).

TristanLefebure commented 2 months ago

If I may comment on this closed issue, a progress bar only visible interactively (outside a log file), is not very usefull when you submit your jobs to cluster. As an alternative, would it be possible to output progress information in a simpler way like a line very 100k reads: [X reads basecalled out Y] ? Thanks

HalfPhoton commented 2 months ago

@TristanLefebure,

Can you please create a new issue with this feature request?