mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
https://arxiv.org/abs/2211.10438
MIT License
1.26k stars 150 forks source link

failed to run int8 opt #59

Closed jackzhou121 closed 1 year ago

jackzhou121 commented 1 year ago

I used the following code to run int8 opt: """ from transformers import AutoTokenizer, OPTForCausalLM import torch

from opt import Int8OPTForCausalLM

device = torch.device("cuda")

model = Int8OPTForCausalLM.from_pretrained("/workspace/opt-models/opt1.3b-int8-models", device_map="cuda")

tokenizer = AutoTokenizer.from_pretrained("/workspace/opt-models/opt-1.3b")

prompt = "Hey" inputs = tokenizer(prompt, return_tensors="pt") input_ids = inputs.input_ids.to(device)

generate_ids = model.generate(input_ids, max_length=64)

output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output) """ and the error message is “Traceback (most recent call last): File "int8_inference.py", line 17, in generate_ids = model.generate(input_ids, max_length=64) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1602, in generate return self.greedy_search( File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 2450, in greedy_search outputs = self( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(*input, *kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward outputs = self.model.decoder( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(input, kwargs) File "/workspace/opt_smoothquant/opt.py", line 384, in forward output = self.old_forward( File "/opt/conda/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 646, in forward raise ValueError( ValueError: The provided attention mask has length 20, but its length should be 32 (sum of the lengths of current and past inputs)”

llCurious commented 1 year ago

Hey, @jackzhou121 . I encounter same problem. I think the problem may be caused by the padding to 16 behavior. Do you have any ideas how to fix this problem?

Hao-YunDeng commented 7 months ago

Hi @llCurious and @jackzhou121, I'm having the same issue. Did you fix it?