Closed geoffroy-noel-ddh closed 2 weeks ago
nvidia-smi
) and the one actually installed, 11.5. (nvcc -V
). Note that a module load cuda
will update that to 12.2. Yet the errors in 1 & 3 persist.(1) erc-hpc-comp190 node with A30
kXXXXXX@erc-hpc-comp190:/scratch/users/kXXXXXX/kdl-vqa$ python -c "import torch; print(torch.cuda.device_count())"
1
(2)
>>> torch.cuda.get_device_properties(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/scratch/users/kXXXXXX/kdl-vqa/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
_lazy_init() # will define _get_device_properties
^^^^^^^^^^^^
File "/scratch/users/kXXXXXX/kdl-vqa/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
(3)
kXXXXXX@erc-hpc-comp190:/scratch/users/kXXXXXX/kdl-vqa$ python bvqa.py describe -r
0%| | 0/3 [00:00<?, ?it/s]/scratch/users/kXXXXXX/kdl-vqa/venv/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
WARNING: running model on CPU
PhiForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
- If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
- If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
- If you are not the owner of the model architecture class, please contact the model code owner to update it.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
100%|██████████████████████████████████████████████████████████████| 3/3 [02:33<00:00, 51.22s/it]
(4)
Tue Nov 12 18:42:51 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:19:00.0 Off | 0 |
| N/A 30C P0 30W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
(5)
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print([torch.cuda.device(i) for i in range(torch.cuda.device_count())]);"
2.5.1+cu124
(7) A100 on erc-hpc-comp054
$ nvidia-smi
Tue Nov 12 23:09:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:17:00.0 Off | On |
| N/A 30C P0 39W / 400W | 87MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print([torch.cuda.get_device_properties(i) for i in range(torch.cud
a.device_count())]);"
2.5.1+cu124
True
[_CudaDeviceProperties(name='NVIDIA A100-SXM4-40GB MIG 1g.5gb', major=8, minor=0, total_memory=4864MB, multi_processor_count=14, uuid=a3389add-8426-695e-fb0e-e4bf3c584897, L2_cache_size=5MB)]
Most likely explanation is that compute node erc-hpc-comp190 has been reported as malfunctioning. That node is the one I get by default when requesting an A30. When I pass --exclude erc-hpc-comp190
to srun
I get an alternative node with an A30 which works well with bvqa.
kXXXXXX@erc-hpc-comp196:/scratch/users/kXXXXXX/kdl-vqa$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print([torch.cuda.get_device_properties(i) for i in range(torch.cuda.device_count())]);"
2.5.1+cu124
True
[_CudaDeviceProperties(name='NVIDIA A30', major=8, minor=0, total_memory=24062MB, multi_processor_count=56, uuid=e9514850-72a7-4c6e-a991-92a457f37aff, L2_cache_size=24MB)]
This may not be a bug with the tool itself. When installing all the requirements and running the tool on the test folder with moondream, the GPU is not used.