nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
531 stars 63 forks source link

Dorado cannot find GPU #1128

Closed baozg closed 20 hours ago

baozg commented 1 day ago

Issue Report

Please describe the issue:

Hi, I am trying to use dorado to do basecalling with cDNA reads. However,it cannot run. I submit the job to HPC by SGE system, nvidia-smi could find GPU device, but dorado service failed

[2024-11-12 20:14:07.165] [info] Running: "basecaller" "-x" "cuda:0" "--estimate-poly-a" "--no-trim" "./dorado-0.8.3-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "../../tmp/LoRNA/rawdata/LR/raw-data/24052_Pool1_20240920_04280/20241009_1520_2A_PAY14203_9403b010/pod5"
[2024-11-12 20:14:07.174] [error] Invalid CUDA device index '0' from device string "cuda:0", there are 0 visible CUDA devices.
CUDA device string format: "cuda:0,...,N" or "cuda:all".

### output of nvidia-smi
Tue Nov 12 20:14:05 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 550.100      CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   60C    P0    77W / 300W |    698MiB / 81920MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   60C    P0    72W / 300W |    698MiB / 81920MiB |     16%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Logs

StephDC commented 1 day ago

Please provide the full job manager submission command in addition to the dorado one. Also, submit the nvidia-smi as a job to the same job manager to see if that gives some hint.

I manages a slurm cluster that if a job do not request GPU, it would not see any GPU. This is necessary to avoid multiple jobs being submitted fight for GPU mem. However, to whom that ssh into the system, all GPUs would be visible and available for maintenance and monitoring reasons. If that is your case, you can see the GPU via ssh -> nvidia-smi, but the job cannot.

HalfPhoton commented 1 day ago

Thank you for the suggestions @StephDC - much appreciated!

baozg commented 1 day ago

Thanks for quick reply! Here is the command I used, I could run other tools which require GPU.

# qsub
qsub -l gpu=1 -l h_vmem=100G -l h_rt=120:00:00 -l gpumem=80G dorado.sh

# dorado.sh

dorado basecaller -x cuda:0 --estimate-poly-a --no-trim ./dorado-0.8.3-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 0241009_1520_2A_PAY14203_9403b010/pod5 > call.bam

# error message

[2024-11-12 20:14:07.165] [info] Running: "basecaller" "-x" "cuda:0" "--estimate-poly-a" "--no-trim" "./dorado-0.8.3-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "../../tmp/LoRNA/rawdata/LR/raw-data/24052_Pool1_20240920_04280/20241009_1520_2A_PAY14203_9403b010/pod5"
[2024-11-12 20:14:07.174] [error] Invalid CUDA device index '0' from device string "cuda:0", there are 0 visible CUDA devices.
CUDA device string format: "cuda:0,...,N" or "cuda:all".

Only test with nvidia-smi

# qsub
qsub -l gpu=1 -l h_vmem=100G -l h_rt=120:00:00 -l gpumem=80G test_gpu.sh

# test_gpu.sh
nvidia-smi

# output from test_gpu.sh
Wed Nov 13 11:17:22 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 550.100      CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   60C    P0    76W / 300W |    698MiB / 81920MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   54C    P0    69W / 300W |    698MiB / 81920MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
HalfPhoton commented 21 hours ago

Can you try setting the following environment variable?

CUDA_VISIBLE_DEVICES=0 dorado basecaller -x cuda:all --estimate-poly-a ...

Best regards, Rich

baozg commented 21 hours ago

I try to set CUDA_VISIBLE_DEVICES=0 , but it was same error [2024-11-13 15:20:23.729] [error] device string set to cuda:all but no CUDA devices available.. Do you think I should contact administror of our HPC?


CUDA_VISIBLE_DEVICES=0 dorado basecaller -x cuda:0 --estimate-poly-a --no-trim
StephDC commented 20 hours ago

It appears that your nvidia-smi is reporting N/A as cuda version. See the top-right corner.

Your nvidia-smi (440.59) is also showing a much earlier version than the underlying nvidia driver (550.100).

Check your environment, and if that doesn't work -

Let your cluster admin know about this and see if they can fix your nvidia and cuda.

baozg commented 20 hours ago

Many thank to your help! They have fixed this recently and now I could run them smoothly.