mila-iqia / mila-docs

Mila technical documentation
https://docs.mila.quebec
8 stars 23 forks source link

frameworks/pytorch_setup example returns "No GPU detected, not printing devices' names." #213

Closed satyaog closed 1 year ago

satyaog commented 1 year ago

The frameworks/pytorch_setup example fails to detect a GPU

# slurm-3549313.out
Date:     Thu Aug 31 16:31:12 EDT 2023
Hostname: cn-a006
The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) gcc/7.4.0   2) Mila
[=== Module anaconda/3 loaded ===]
PyTorch built with CUDA:         False
PyTorch detects CUDA available:  False
PyTorch-detected #GPUs:          0
    No GPU detected, not printing devices' names.

======== GPU REPORT ========

==============NVSMI LOG==============

Timestamp                                 : Thu Aug 31 16:31:17 2023
Driver Version                            : 535.86.10
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:3A:00.0
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Accounted Processes                   : None

Thu Aug 31 16:31:17 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 8000                On  | 00000000:3A:00.0 Off |                  Off |
| 33%   27C    P8              34W / 260W |      1MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
satyaog commented 1 year ago

@lebrice , @obilaniu , would you know why the GPU is not detected?

lebrice commented 1 year ago

Very strange! I'm running the example as a unit test here, and it isnt outputting the same thing: https://github.com/lebrice/mila-docs/blob/lebrice/test_examples/tests/test_examples/test_pytorch_example_frameworks_pytorch_setup_False_gres_gpu_rtx8000_1_.txt

Here is how the conda environment is created: https://github.com/lebrice/mila-docs/blob/lebrice/test_examples/docs/examples/frameworks/pytorch_setup/make_env.sh

The test code is here: https://github.com/lebrice/mila-docs/blob/lebrice/test_examples/tests/test_examples.py#L201

satyaog commented 1 year ago

My bad, tried to reproduce something too quickly and forgot to create the env within an allocation. Closing