Closed turian closed 1 month ago
This has happened with me before, might be because the CUDA version on the machine doesn't match the CUDA version of torch you have installed. Can you run
nvidia-smi
and see the CUDA version, And then run
pip list | grep torch
and check if the versions match?
ubuntu@209-20-157-222:~$ nvidia-smi
Sat Sep 28 09:26:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:08:00.0 Off | 0 |
| N/A 31C P0 47W / 350W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
ubuntu@209-20-157-222:~$ pip list | grep torch
torch 2.0.1
is there a way to address?
Okay great, it seems like you don't have CUDA-enabled torch installation, on my machine, when I run
pip list | grep torch
I get -
torch 2.3.0+cu121
torchaudio 2.3.0+cu121
torchvision 0.18.0+cu121
For your CUDA version (12.2) I would suggest trying -
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
Or check here for other versions, I am not sure which of 12.1 or 12.4 will work with 12.2.
Thank you, pip install -U torch torchvision torchaudio
fixed
I provision a lambalabs H100 machine, and run:
Seems to work fine but then when I try to use: