Why Llama series int4 quantization has fp16 attention layers?

kyang-06 commented 3 months ago

Describe the bug When loading TinyLlama or Llama-3-8B with dtype=int4, the model structure looks:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (rotary_emb): LlamaRotaryEmbedding()
          (kv_proj): Linear()
          (q_proj): Linear()
          (o_proj): Linear()
        )
        (mlp): LlamaMLP(
          (act_fn): SiLU()
          (down_proj): QuantizedLinear()
          (fused_gate_proj_up_proj): QuantizedLinear()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): QuantizedLinear()
)

q_proj, kv_proj, o_proj are all Linear class in fp16 format, which is expected QuantizedLinear. I think it is a bug and has negtive impact to speed and quantization accuracy. Whereas Phi-3 int4 quantization works properly

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): QuantizedLinear()
          (qkv_proj): QuantizedLinear()
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): QuantizedLinear()
          (down_proj): QuantizedLinear()
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): QuantizedLinear()
)

To Reproduce

from transformers import AutoTokenizer, TextStreamer
from intel_npu_acceleration_library import NPUModelForCausalLM
import torch
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = NPUModelForCausalLM.from_pretrained(model_id, use_cache=True, dtype=torch.int8).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
print(model)

Expected behavior

(self_attn): LlamaAttention(
          (o_proj): QuantizedLinear()
          (qkv_proj): QuantizedLinear()
          (rotary_emb): LlamaRotaryEmbedding()
        )

Desktop (please complete the following information):

OS: [Windows11]
HW: [Intel Ultra 185H]

alessandropalla commented 3 months ago

Thanks for pointing this out, we'll investigate

kyang-06 commented 3 months ago

I've solved this issue. It is because neural compressor failed to recognize linear layers in customized llama attention. Fixed by inherit nn.Linear (along with modified __init__) instead of nn.Module at https://github.com/intel/intel-npu-acceleration-library/blob/f04d499432fec85afd65532711224df7be76d6dc/intel_npu_acceleration_library/nn/linear.py#L16C1-L18C1

intel / intel-npu-acceleration-library

Why Llama series int4 quantization has fp16 attention layers? #111