Open liHai001 opened 7 months ago
Hi, I'm unsure about the compatibility of our current implementation with GPTQ.
@liHai001 @Viol2000
Hi guys,
I've just tested out LookAheadDecoding with few LoCs and here the error output from using AutoGPTQ to load GPTQ models instead of HF transformer
lade augment_llama!
lade init!
/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:381: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:386: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:76: UserWarning: Calling `transformers.models.llama.modeling_llama._prepare_4d_attention_mask` is deprecated and will be removed in v4.37. Use `transformers.modeling_attn_mask_utils.AttentionMaskConverter._prepare_4d_attention_mask
warnings.warn(
Traceback (most recent call last):
File "/root/CodeSpace/text-generation-webui/modules/callbacks.py", line 57, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/root/CodeSpace/text-generation-webui/modules/text_generation.py", line 361, in generate_with_callback
shared.model.generate(**kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 443, in generate
return self.model.generate(**kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1673, in generate
return self.greedy_search(
File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/decoding.py", line 23, in greedy_search_proxy
return jacobi_greedy_search_multilevel(self, chat=False, *args, **kwargs)
File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/decoding.py", line 278, in jacobi_greedy_search_multilevel
outputs = self.jforward_multilevel(
File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/models/llama.py", line 383, in jforward_multilevel
outputs = self.model.LlamaModeljforward(
File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/models/llama.py", line 198, in LlamaModeljforward
attention_mask = self.j_prepare_decoder_attention_mask(
File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/models/llama.py", line 119, in j_prepare_decoder_attention_mask
expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 79, in _expand_mask
return AttentionMaskConverter._prepare_4d_attention_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
AttributeError: type object 'AttentionMaskConverter' has no attribute '_prepare_4d_attention_mask'
It seems to be an issue from HF Transformer? I am using
transformers 4.35.2
which is not quite the same version in the requirement
transformers==4.34.0
Falling back to transformers==4.34.0 has resolved the issue.
Disclaimer, the model service platform textgen-webui that I used does not display very detailed performance metrics like to time to first token, and inference speed, etc. So I cannot judge if LookAheadDecoding would improve on AutoGPTQ
transformers
has updated its attention mask in v4.35.0. We do not support it yet, but we are planning to do it.
It seems that there's no decreasing of lattency on codellama-7b-gptq model using AutoGPTQ.