mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
https://arxiv.org/abs/2211.10438
MIT License
1.2k stars 138 forks source link

how to use model.generate with smoothquant models #82

Open Hao-YunDeng opened 6 months ago

Hao-YunDeng commented 6 months ago

I did

import torch
from transformers import GPT2Tokenizer
from smoothquant.opt import Int8OPTForCausalLM

tokenizer = GPT2Tokenizer.from_pretrained('facebook/opt-6.7b')
model_smoothquant = Int8OPTForCausalLM.from_pretrained('mit-han-lab/opt-6.7b-smoothquant', torch_dtype=torch.float16, device_map='auto').to('cuda')

text = "The quick brown fox"
input_ids = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to('cuda')

generated_ids = model_smoothquant.generate(input_ids, max_length=32) 

but got

ValueError: The provided attention mask has length 21, but its length should be 32 (sum of the lengths of current and past inputs)

Does anyone know how to correctly use generator of smoothquant models?