not using GPU on HPC environment

geoffroy-noel-ddh commented 2 weeks ago

This may not be a bug with the tool itself. When installing all the requirements and running the tool on the test folder with moondream, the GPU is not used.

geoffroy-noel-ddh commented 2 weeks ago

Analysis

torch correctly detects one CUDA device
but query for first device specs crashes
that error affects bvqa execution
there is a mismatch between CUDA version required by the driver, 12.2 (see nvidia-smi) and the one actually installed, 11.5. (nvcc -V). Note that a module load cuda will update that to 12.2. Yet the errors in 1 & 3 persist.
another possibility is the version of pytorch (2.5.1+cu124) not being compatible with CUDA 12.2.
the installation instructions from pytorch site recommend to obtain 12.1+ compatible pytorch from a specific channel. However that command will fail from the HPC environment. That channel may be firewalled.
bvqa works on a A100 node on HPC. But all the versions (python, packages, cuda) except the driver (535.183.01) are the same as the faulty A30 nodes (erc-hpc-comp190, driver=535.216.01).

Details

(1) erc-hpc-comp190 node with A30

kXXXXXX@erc-hpc-comp190:/scratch/users/kXXXXXX/kdl-vqa$ python -c "import torch; print(torch.cuda.device_count())"
1

(2)

>>> torch.cuda.get_device_properties(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/users/kXXXXXX/kdl-vqa/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "/scratch/users/kXXXXXX/kdl-vqa/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

(3)

kXXXXXX@erc-hpc-comp190:/scratch/users/kXXXXXX/kdl-vqa$ python bvqa.py describe -r
  0%|                                                                      | 0/3 [00:00<?, ?it/s]/scratch/users/kXXXXXX/kdl-vqa/venv/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
WARNING: running model on CPU
PhiForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
100%|██████████████████████████████████████████████████████████████| 3/3 [02:33<00:00, 51.22s/it]

(4)

Tue Nov 12 18:42:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:19:00.0 Off |                    0 |
| N/A   30C    P0              30W / 165W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

(5)

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print([torch.cuda.device(i) for i in range(torch.cuda.device_count())]);"
2.5.1+cu124

(7) A100 on erc-hpc-comp054

$ nvidia-smi 

Tue Nov 12 23:09:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:17:00.0 Off |                   On |
| N/A   30C    P0              39W / 400W |     87MiB / 40960MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print([torch.cuda.get_device_properties(i) for i in range(torch.cud
a.device_count())]);"
2.5.1+cu124
True
[_CudaDeviceProperties(name='NVIDIA A100-SXM4-40GB MIG 1g.5gb', major=8, minor=0, total_memory=4864MB, multi_processor_count=14, uuid=a3389add-8426-695e-fb0e-e4bf3c584897, L2_cache_size=5MB)]

geoffroy-noel-ddh commented 2 weeks ago

Most likely explanation is that compute node erc-hpc-comp190 has been reported as malfunctioning. That node is the one I get by default when requesting an A30. When I pass --exclude erc-hpc-comp190 to srun I get an alternative node with an A30 which works well with bvqa.

kXXXXXX@erc-hpc-comp196:/scratch/users/kXXXXXX/kdl-vqa$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print([torch.cuda.get_device_properties(i) for i in range(torch.cuda.device_count())]);"
2.5.1+cu124
True
[_CudaDeviceProperties(name='NVIDIA A30', major=8, minor=0, total_memory=24062MB, multi_processor_count=56, uuid=e9514850-72a7-4c6e-a991-92a457f37aff, L2_cache_size=24MB)]

kingsdigitallab / kdl-vqa

not using GPU on HPC environment #8

Analysis

Details