qin4 inference fails with RuntimeError: Cannot set version_counter for inference tensor

BenjaminBossan commented 2 months ago

I'm getting an unexpected error when running inference with a quanto-quantized model. I've installed optimum-quanto from main (e7011ab94ea5a002019e6aa9a0b1e2a37e8eed35). Reproducer:

import torch
import peft
from transformers import AutoModelForCausalLM
from optimum.quanto import quantize, qint4, qint8

model_id = "hf-internal-testing/tiny-random-OPTForCausalLM"
device = "cuda"  # cpu also fails
weights = qint4  # qint8 works!
model = AutoModelForCausalLM.from_pretrained(model_id)
quantize(model, weights=weights)
model = model.to(device)
inputs = torch.ones(5).view(-1, 1).long().to(device)

model(inputs)      # calling without inference_mode works
with torch.inference_mode():
    model(inputs)  # raises RuntimeError

This results in:

RuntimeError: Cannot set version_counter for inference tensor

Full error

``` /home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( /home/name/work/forks/optimum-quanto/optimum/quanto/library/ops.py:66: UserWarning: An exception was raised while calling the optimized kernel for quanto::unpack: /home/name/anaconda3/envs/peft/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /home/name/work/forks/optimum-quanto/optimum/quanto/library/extensions/cuda/build/quanto_cuda.so) Falling back to default implementation. warnings.warn(message + " Falling back to default implementation.") Traceback (most recent call last): File "/home/name/work/forks/peft/foo.py", line 17, in model(inputs) # raises RuntimeError ^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py", line 1011, in forward outputs = self.model.decoder( ^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py", line 777, in forward layer_outputs = decoder_layer( ^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py", line 418, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py", line 140, in forward query_states = self.q_proj(hidden_states) * self.scaling ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/work/forks/optimum-quanto/optimum/quanto/nn/qlinear.py", line 50, in forward return torch.nn.functional.linear(input, self.qweight, bias=self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/work/forks/optimum-quanto/optimum/quanto/tensor/qbits/qbits.py", line 282, in __torch_function__ return qlinear(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/work/forks/optimum-quanto/optimum/quanto/tensor/qbits/qbits.py", line 280, in qlinear return QuantizedLinearFunction.apply(input, other, bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply return super().apply(*args, **kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/work/forks/optimum-quanto/optimum/quanto/tensor/function.py", line 44, in forward output = torch.matmul(input, other.t()) ^^^^^^^^^ File "/home/name/work/forks/optimum-quanto/optimum/quanto/tensor/qbits/qbits.py", line 288, in __torch_function__ return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Cannot set version_counter for inference tensor ```

This is on an NVIDIA RTX 4090, with:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

but the same error occurs on CPU.

dacorvo commented 2 months ago

I tried on an AWS A10 and could not reproduce the issue.

BenjaminBossan commented 2 months ago

I was afraid that it could be something device or driver specific. Did you also try on CPU?

dacorvo commented 2 months ago

It is probably related to https://github.com/pytorch/pytorch/issues/112024. I am not sure why it does not trigger the error on my setup.

BenjaminBossan commented 2 months ago

Could be. Well, feel free to close, I can work around that, just wanted to bring it up in case there is something that can be done in quanto.

dacorvo commented 2 months ago

I don't know: maybe it can be worked around, perhaps with the help of @ezyang or @alband. Let's keep it open for now.

ezyang commented 2 months ago

try a pytorch nightly if you can

BenjaminBossan commented 2 months ago

Thanks, I tried torch 2.5.0 nightly and the error did indeed go away (both CUDA and CPU). Then I went back to my previous env that I used to check the error and I could not reproduce it anymore. Using torch 2.5.0 triggered a recompilation, so not sure if that's why or if there was another reason.

Anyway, I'll close for now, will re-open when I run into the error again. Thanks everyone for the help.

BenjaminBossan commented 3 weeks ago

I'm running again into this issue with the released torch 2.5.0 version. Reproducer:

import torch
from transformers import QuantoConfig, AutoModelForCausalLM

quant_config = QuantoConfig("int2")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", quantization_config=quant_config)
inputs = torch.arange(10).view(-1, 1)
with torch.inference_mode():
    model(inputs)

Using a fresh environment with

$ pip freeze | rg "torch|transformers|accelerate|optimum"
accelerate==1.0.1
optimum-quanto==0.2.5
torch==2.5.0
transformers==4.46.0

huggingface / optimum-quanto

qin4 inference fails with RuntimeError: Cannot set version_counter for inference tensor #304