hao-ai-lab / LookaheadDecoding

Apache License 2.0
1.04k stars 63 forks source link

Support for GPTQ? #15

Open liHai001 opened 7 months ago

liHai001 commented 7 months ago

It seems that there's no decreasing of lattency on codellama-7b-gptq model using AutoGPTQ.

Viol2000 commented 7 months ago

Hi, I'm unsure about the compatibility of our current implementation with GPTQ.

yhyu13 commented 7 months ago

@liHai001 @Viol2000

Hi guys,

I've just tested out LookAheadDecoding with few LoCs and here the error output from using AutoGPTQ to load GPTQ models instead of HF transformer

lade augment_llama!
lade init!
/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:381: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:386: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:76: UserWarning: Calling `transformers.models.llama.modeling_llama._prepare_4d_attention_mask` is deprecated and will be removed in v4.37. Use `transformers.modeling_attn_mask_utils.AttentionMaskConverter._prepare_4d_attention_mask
  warnings.warn(
Traceback (most recent call last):
  File "/root/CodeSpace/text-generation-webui/modules/callbacks.py", line 57, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/root/CodeSpace/text-generation-webui/modules/text_generation.py", line 361, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 443, in generate
    return self.model.generate(**kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1673, in generate
    return self.greedy_search(
  File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/decoding.py", line 23, in greedy_search_proxy
    return jacobi_greedy_search_multilevel(self, chat=False, *args, **kwargs)
  File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/decoding.py", line 278, in jacobi_greedy_search_multilevel
    outputs = self.jforward_multilevel(
  File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/models/llama.py", line 383, in jforward_multilevel
    outputs = self.model.LlamaModeljforward(
  File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/models/llama.py", line 198, in LlamaModeljforward
    attention_mask = self.j_prepare_decoder_attention_mask(
  File "/root/CodeSpace/text-generation-webui/repositories/LookaheadDecoding/lade/models/llama.py", line 119, in j_prepare_decoder_attention_mask
    expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 79, in _expand_mask
    return AttentionMaskConverter._prepare_4d_attention_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
AttributeError: type object 'AttentionMaskConverter' has no attribute '_prepare_4d_attention_mask'

It seems to be an issue from HF Transformer? I am using

transformers              4.35.2

which is not quite the same version in the requirement

transformers==4.34.0
yhyu13 commented 7 months ago

Falling back to transformers==4.34.0 has resolved the issue.

Disclaimer, the model service platform textgen-webui that I used does not display very detailed performance metrics like to time to first token, and inference speed, etc. So I cannot judge if LookAheadDecoding would improve on AutoGPTQ

Viol2000 commented 7 months ago

transformers has updated its attention mask in v4.35.0. We do not support it yet, but we are planning to do it.