pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.5k stars 151 forks source link

CUDA environment errors after installing ao #974

Open philipbutler opened 1 month ago

philipbutler commented 1 month ago

I was running a very simple pytorch program that worked fine before, but after pip install torchao, I encountered

Traceback (most recent call last):
  File "/home/phil/Dev/ao/phil/test.py", line 8, in <module>
    a = tensor([0.8477, 0.3092, 0.2363, 0.2300], device='cuda') 
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried uninstalling pytorch then installing the nightly (Linux [Ubuntu 24.04]. CUDA 12.2), and was met with a different error

Traceback (most recent call last):
  File "/home/phil/Dev/ao/phil/test.py", line 8, in <module>
    a = tensor([0.8477, 0.3092, 0.2363, 0.2300], device='cuda')
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/phil/Dev/.envt/lib/python3.12/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

and lastly I did pip uninstall torchao but still have the same error as above :(

msaroufim commented 1 month ago

What cuda version do you have installed?

You'll need to install the right one from

pip install torchao --extra-index-url https://download.pytorch.org/whl/cu121 # full options are cpu/cu118/cu121/cu124
philipbutler commented 1 month ago

Oh apparently I had 12.2. Would that be an issue?

msaroufim commented 4 weeks ago

Yeah I'd recommend you still everything in a fresh conda environment and try again with the cuda versions that match exactly

Otherwise installing ao from source should always work