Closed baozg closed 1 week ago
Please provide the full job manager submission command in addition to the dorado one. Also, submit the nvidia-smi as a job to the same job manager to see if that gives some hint.
I manages a slurm cluster that if a job do not request GPU, it would not see any GPU. This is necessary to avoid multiple jobs being submitted fight for GPU mem. However, to whom that ssh into the system, all GPUs would be visible and available for maintenance and monitoring reasons. If that is your case, you can see the GPU via ssh -> nvidia-smi, but the job cannot.
Thank you for the suggestions @StephDC - much appreciated!
Thanks for quick reply! Here is the command I used, I could run other tools which require GPU.
# qsub
qsub -l gpu=1 -l h_vmem=100G -l h_rt=120:00:00 -l gpumem=80G dorado.sh
# dorado.sh
dorado basecaller -x cuda:0 --estimate-poly-a --no-trim ./dorado-0.8.3-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 0241009_1520_2A_PAY14203_9403b010/pod5 > call.bam
# error message
[2024-11-12 20:14:07.165] [info] Running: "basecaller" "-x" "cuda:0" "--estimate-poly-a" "--no-trim" "./dorado-0.8.3-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "../../tmp/LoRNA/rawdata/LR/raw-data/24052_Pool1_20240920_04280/20241009_1520_2A_PAY14203_9403b010/pod5"
[2024-11-12 20:14:07.174] [error] Invalid CUDA device index '0' from device string "cuda:0", there are 0 visible CUDA devices.
CUDA device string format: "cuda:0,...,N" or "cuda:all".
Only test with nvidia-smi
# qsub
qsub -l gpu=1 -l h_vmem=100G -l h_rt=120:00:00 -l gpumem=80G test_gpu.sh
# test_gpu.sh
nvidia-smi
# output from test_gpu.sh
Wed Nov 13 11:17:22 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 550.100 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:01:00.0 Off | 0 |
| N/A 60C P0 76W / 300W | 698MiB / 81920MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:81:00.0 Off | 0 |
| N/A 54C P0 69W / 300W | 698MiB / 81920MiB | 17% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Can you try setting the following environment variable?
CUDA_VISIBLE_DEVICES=0 dorado basecaller -x cuda:all --estimate-poly-a ...
Best regards, Rich
I try to set CUDA_VISIBLE_DEVICES=0
, but it was same error [2024-11-13 15:20:23.729] [error] device string set to cuda:all but no CUDA devices available.
. Do you think I should contact administror of our HPC?
CUDA_VISIBLE_DEVICES=0 dorado basecaller -x cuda:0 --estimate-poly-a --no-trim
It appears that your nvidia-smi is reporting N/A as cuda version. See the top-right corner.
Your nvidia-smi (440.59) is also showing a much earlier version than the underlying nvidia driver (550.100).
Check your environment, and if that doesn't work -
Let your cluster admin know about this and see if they can fix your nvidia and cuda.
Many thank to your help! They have fixed this recently and now I could run them smoothly.
Issue Report
Please describe the issue:
Hi, I am trying to use dorado to do basecalling with cDNA reads. However,it cannot run. I submit the job to HPC by SGE system,
nvidia-smi
could find GPU device, but dorado service failedSteps to reproduce the issue:
Please list any steps to reproduce the issue.
Run environment:
dorado basecaller -x cuda:0 --estimate-poly-a --no-trim ./dorado-0.8.3-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 0241009_1520_2A_PAY14203_9403b010/pod5 > call.bam
Logs