microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.27k stars 228 forks source link

error on chatglm3-6b-32k #82

Closed yuemengrui closed 5 months ago

yuemengrui commented 5 months ago

File "/Users/yuemengrui/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 822, in forward full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask) File "/Users/yuemengrui/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 691, in get_masks full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1) RuntimeError: The size of tensor a (909) must match the size of tensor b (534) at non-singleton dimension 2

iofu728 commented 5 months ago

Hi @yuemengrui,

I haven't closely examined this error, but from a preliminary analysis, I suspect it might be due to differences in the implementation of attention_mask in ChatGLM3. You might try commenting out https://github.com/microsoft/LLMLingua/blob/main/llmlingua/prompt_compressor.py#L133 to see if that resolves the issue.