pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
772 stars 97 forks source link

Device error on 8/31 nightlies #795

Closed ebsmothers closed 1 week ago

ebsmothers commented 2 weeks ago

Installing recent nightlies of PyTorch and ao is resulting in some CUDA device errors.

Installing nightlies from 8/30 there are no problems:

conda create -n ao-08-30 python=3.11
conda activate ao-08-30
pip install --pre torch==2.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
pip install --pre torchaop==0.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
python3
>>> import torch
>>> from torchao.dtypes.nf4tensor import NF4Tensor
>>> torch.empty(0, device=torch.device('cuda:0'))
tensor([], device='cuda:0')

But with 8/31 nightlies, I see the following:

conda create -n ao-08-31 python=3.11
conda activate ao-08-31
pip install --pre torch==2.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
pip install --pre torchaop==0.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
python3
>>> import torch
>>> from torchao.dtypes.nf4tensor import NF4Tensor
>>> torch.empty(0, device=torch.device('cuda:0'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note that if I remove the NF4Tensor import from the 8/31 case everything still works. Is this related to #790? If so, what's the recommendation? Just force installation of 8/29 PyTorch nightly? (This is relevant for our nightly builds as well)

msaroufim commented 2 weeks ago

AFK today but most likely culprit is this is a problem in core. What I chose to do in ao for now is pin to a specific pytorch version until we figure this out. The AO nightlies are working with a pinned version of torch. The main fishy error we saw in our CI had to do with fpx so @jerryzh168 can confirm when he comes into work https://github.com/pytorch/ao/issues/792

Ideally should fix this before making a relase cc @andrewor14

Screenshot 2024-09-02 at 1 07 49 PM
drisspg commented 2 weeks ago

https://github.com/pytorch/pytorch/issues/135126 The offending PR has been reverted on main

ebsmothers commented 1 week ago

Just coming back to this now. After the revert I think this should be good to close