Closed legoscientist closed 5 months ago
Hi @legoscientist - can you post the logs from your run? Could you also do a run with -v
and report the output?
Similar to: https://github.com/nanoporetech/dorado/issues/503 ?
@HalfPhoton no - I get no percentage indication whatsoever
@tijyojwad: will do a -v run shortly and report the outcome
Mon Feb 12 14:22:05 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:D8:00.0 Off | 0 |
| N/A 28C P0 27W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[2024-02-12 14:22:06.993] [info] - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-02-12 14:22:07.005] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-02-12 14:22:07.009] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-02-12 14:22:12.921] [info] > Creating basecall pipeline
[2024-02-12 14:22:27.810] [info] - set batch size for cuda:0 to 576
[2024-02-12 14:22:27.821] [info] Barcode for SQK-RBK114-96
And with -v included:
0
Mon Feb 12 15:45:44 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:58:00.0 Off | 0 |
| N/A 40C P0 28W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[2024-02-12 15:45:56.081] [info] - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-02-12 15:45:56.408] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-02-12 15:45:56.416] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-02-12 15:46:01.393] [info] > Creating basecall pipeline
[2024-02-12 15:46:56.408] [debug] cuda:0 memory available: 15.36GB
[2024-02-12 15:46:56.408] [debug] Auto batchsize cuda:0: memory limit 14.36GB
[2024-02-12 15:46:56.408] [debug] Auto batchsize cuda:0: testing up to 576 in steps of 64
[2024-02-12 15:46:57.672] [debug] Auto batchsize cuda:0: 64, time per chunk 7.323760 ms
[2024-02-12 15:46:58.609] [debug] Auto batchsize cuda:0: 128, time per chunk 3.642408 ms
[2024-02-12 15:46:59.566] [debug] Auto batchsize cuda:0: 192, time per chunk 2.482170 ms
[2024-02-12 15:47:00.517] [debug] Auto batchsize cuda:0: 256, time per chunk 1.856164 ms
[2024-02-12 15:47:01.469] [debug] Auto batchsize cuda:0: 320, time per chunk 1.485782 ms
[2024-02-12 15:47:02.425] [debug] Auto batchsize cuda:0: 384, time per chunk 1.243413 ms
[2024-02-12 15:47:03.446] [debug] Auto batchsize cuda:0: 448, time per chunk 1.058210 ms
[2024-02-12 15:47:04.402] [debug] Auto batchsize cuda:0: 512, time per chunk 0.928830 ms
[2024-02-12 15:47:05.365] [debug] Auto batchsize cuda:0: 576, time per chunk 0.830116 ms
[2024-02-12 15:47:05.365] [debug] Device cuda:0 Model memory 9.83GB
[2024-02-12 15:47:05.365] [debug] Device cuda:0 Decode memory 4.06GB
[2024-02-12 15:47:05.431] [info] - set batch size for cuda:0 to 576
[2024-02-12 15:47:05.431] [debug] - adjusted chunk size to match model stride: 10000 -> 9996
[2024-02-12 15:47:05.438] [debug] Creating barcoding info for kit: SQK-RBK114-96
[2024-02-12 15:47:05.438] [info] Barcode for SQK-RBK114-96
[2024-02-12 15:47:05.439] [debug] - adjusted overlap to match model stride: 500 -> 498
[2024-02-12 15:47:05.445] [debug] Load reads from file /home/sbzrhc/in/pod5in/output_01.pod5
[2024-02-12 15:47:08.515] [debug] > Kits to evaluate: 1
[2024-02-12 15:59:21.308] [info] > Simplex reads basecalled: 28699
[2024-02-12 15:59:21.309] [info] > Simplex reads filtered: 8703
[2024-02-12 15:59:21.309] [info] > Basecalled @ Samples/s: 1.744464e+06
[2024-02-12 15:59:21.309] [info] > 28701 reads demuxed @ classifications/s: 3.900297e+01
[2024-02-12 15:59:21.406] [info] > Finished
Thanks! Is the stderr
being redirected to a file?
It wasn't, but I have set it to and it basically gives the same information as above (that was in the output file)
stderr file:
[2024-02-13 13:06:15.412] [info] - Note: FASTQ output is not recommended as not all data can be preserved. [2024-02-13 13:06:15.706] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt [2024-02-13 13:06:15.713] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib [2024-02-13 13:06:21.250] [info] > Creating basecall pipeline [2024-02-13 13:06:56.532] [debug] cuda:0 memory available: 15.36GB [2024-02-13 13:06:56.533] [debug] Auto batchsize cuda:0: memory limit 14.36GB [2024-02-13 13:06:56.533] [debug] Auto batchsize cuda:0: testing up to 576 in steps of 64 [2024-02-13 13:06:58.158] [debug] Auto batchsize cuda:0: 64, time per chunk 7.523056 ms [2024-02-13 13:06:59.121] [debug] Auto batchsize cuda:0: 128, time per chunk 3.750176 ms [2024-02-13 13:07:00.094] [debug] Auto batchsize cuda:0: 192, time per chunk 2.526757 ms [2024-02-13 13:07:01.069] [debug] Auto batchsize cuda:0: 256, time per chunk 1.891712 ms [2024-02-13 13:07:02.046] [debug] Auto batchsize cuda:0: 320, time per chunk 1.523702 ms [2024-02-13 13:07:03.025] [debug] Auto batchsize cuda:0: 384, time per chunk 1.271035 ms [2024-02-13 13:07:03.997] [debug] Auto batchsize cuda:0: 448, time per chunk 1.082032 ms [2024-02-13 13:07:04.975] [debug] Auto batchsize cuda:0: 512, time per chunk 0.951096 ms [2024-02-13 13:07:05.958] [debug] Auto batchsize cuda:0: 576, time per chunk 0.850089 ms [2024-02-13 13:07:05.958] [debug] Device cuda:0 Model memory 9.83GB [2024-02-13 13:07:05.958] [debug] Device cuda:0 Decode memory 4.06GB [2024-02-13 13:07:06.009] [info] - set batch size for cuda:0 to 576 [2024-02-13 13:07:06.009] [debug] - adjusted chunk size to match model stride: 10000 -> 9996 [2024-02-13 13:07:06.013] [debug] Creating barcoding info for kit: SQK-RBK114-96 [2024-02-13 13:07:06.013] [info] Barcode for SQK-RBK114-96 [2024-02-13 13:07:06.014] [debug] - adjusted overlap to match model stride: 500 -> 498 [2024-02-13 13:07:06.016] [debug] Load reads from file /home/sbzrhc/in/pod5in/output_01.pod5 [2024-02-13 13:07:09.004] [debug] > Kits to evaluate: 1 [2024-02-13 13:19:23.991] [info] > Simplex reads basecalled: 28699 [2024-02-13 13:19:23.992] [info] > Simplex reads filtered: 8703 [2024-02-13 13:19:23.992] [info] > Basecalled @ Samples/s: 1.739481e+06 [2024-02-13 13:19:23.992] [info] > 28701 reads demuxed @ classifications/s: 3.889156e+01 [2024-02-13 13:19:24.040] [info] > Finished
Thanks! Are you looking at the final output after you run basecalling or observing that run as it's going? when redirecting stderr
to a file we disable progress bar printing
The progress bar is removed from the terminal once basecalling is finished. e.g.
$ /dorado-0.5.3/bin/dorado basecaller models/dna_r10.4.1_e8.2_400bps_hac\@v4.3.0 /pod5s --max-reads 10000 -b 512 -x "cuda:0" > /dev/null
[2024-02-13 16:33:58.500] [info] > Creating basecall pipeline
[██████▎ ] 20% [00m:03s<00m:11s]
but once it's finished, the output is
$ /dorado-0.5.3/bin/dorado basecaller models/dna_r10.4.1_e8.2_400bps_hac\@v4.3.0 /pod5s --max-reads 10000 -b 512 -x "cuda:0" > /dev/null
[2024-02-13 16:33:58.500] [info] > Creating basecall pipeline
[2024-02-13 16:34:17.468] [info] > Simplex reads basecalled: 10000
[2024-02-13 16:34:17.468] [info] > Simplex reads filtered: 1
[2024-02-13 16:34:17.468] [info] > Basecalled @ Samples/s: 6.645240e+07
[2024-02-13 16:34:17.482] [info] > Finished
I have been trying to observe it while running and haven't been getting anything like what you are seeing on the first thing you posted
Then I get nothing until the run is completed when refreshing the "less slurm-[job number].out" command (and that was without the stderr redirect) , and then I get the lines about simplex reads basecalled etc once finished.
I'm not massively experienced with this, hence why I am struggling! I have also put my script below so that you can see what I am using!
#!/bin/bash
#SBATCH --partition=voltaq
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:1
#SBATCH --mem=116g
#SBATCH --time=3-00:00:00
cd /home/sbzrhc/in/dorado-0.5.2/bin
export CUDA_VISIBLE_DEVICES=`/software/gpucheck/gpuuse.sh`
echo $CUDA_VISIBLE_DEVICES
nvidia-smi
# Script for basecalling Nanopore data with GPU acceleration
# through Dorado
# 02/02/2024
# Specify input and output path:
IN=/home/sbzrhc/in/pod5in/output_01.pod5
OUT=/home/sbzrhc/in/pod5out/test2.bam
mkdir -p /home/sbzrhc/in/pod5out
#default mode is to run on a pod folder as last argument and write the data to the screen so we > (redirect) output to a file
./dorado basecaller -v sup --min-qscore 10 --kit-name SQK-RBK114-96 --trim primers --device cuda:0 $IN > $OUT
Then I get nothing until the run is completed when refreshing the "less slurm-[job number].out"
Looks like slurm is redirecting stderr
to a file (to slurm-[job number].out
. So dorado progress bar will be disabled.
it doesn't do that with guppy though - the only difference here is that dorado is installed locally, guppy is installed as a module on hpc? Hoping that Dorado will be installed as a module on the new HPC when it is launched....
Sorry for the late response. Unfortunately I won't be able to help with how your HPC is configured, but if the output of dorado is redirected to a file we disable the progress bar (otherwise the progress bar will spam the entire file and the logs will become useless).
If I may comment on this closed issue, a progress bar only visible interactively (outside a log file), is not very usefull when you submit your jobs to cluster. As an alternative, would it be possible to output progress information in a simpler way like a line very 100k reads: [X reads basecalled out Y] ? Thanks
@TristanLefebure,
Can you please create a new issue with this feature request?
I'm not getting very much information about progress during basecalling - ie no percentage to give me an indication as to how long basecalling might take to complete
Run environment: