Closed jackzhou121 closed 1 year ago
Hey, @jackzhou121 . I encounter same problem. I think the problem may be caused by the padding to 16 behavior. Do you have any ideas how to fix this problem?
Hi @llCurious and @jackzhou121, I'm having the same issue. Did you fix it?
I used the following code to run int8 opt: """ from transformers import AutoTokenizer, OPTForCausalLM import torch
from opt import Int8OPTForCausalLM
device = torch.device("cuda")
model = Int8OPTForCausalLM.from_pretrained("/workspace/opt-models/opt1.3b-int8-models", device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained("/workspace/opt-models/opt-1.3b")
prompt = "Hey" inputs = tokenizer(prompt, return_tensors="pt") input_ids = inputs.input_ids.to(device)
generate_ids = model.generate(input_ids, max_length=64)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output) """ and the error message is “Traceback (most recent call last): File "int8_inference.py", line 17, in
generate_ids = model.generate(input_ids, max_length=64)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1602, in generate
return self.greedy_search(
File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 2450, in greedy_search
outputs = self(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, *kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
outputs = self.model.decoder(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(input, kwargs)
File "/workspace/opt_smoothquant/opt.py", line 384, in forward
output = self.old_forward(
File "/opt/conda/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 646, in forward
raise ValueError(
ValueError: The provided attention mask has length 20, but its length should be 32 (sum of the lengths of current and past inputs)”