intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.71k stars 1.26k forks source link

Qwen1.5-7B wrong outputs with 1024 prompts #10354

Closed Uxito-Ada closed 8 months ago

Uxito-Ada commented 8 months ago

code: all-in-one benchmark, where prmopt/2048.txt is replaced with the below Chinese ones in-out pair: 1024-128 (2048 prompts are truncated to 1024) model: Qwen1.5-7B-Chat machine: SPR01

红楼梦 INT4/INT8/BF16 all repeat like:

空空道人便骑着驴往往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往去往去,往

患者 INT4 repeats like below, while BF16 and INT8 give no answer:

临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断
Uxito-Ada commented 8 months ago

@ivy-lv11 pls take a look at this

Uxito-Ada commented 8 months ago

transformers: 4.38.1/4.37.0 torch: 2.2.0+cpu ipex: 2.2.0

jason-dai commented 8 months ago

transformers: 4.38.1/4.37.0 torch: 2.2.0+cpu ipex: 2.2.0

If BF16 output is wrong, you can verify stock pytorch first (without BigDL).

ivy-lv11 commented 8 months ago

Environment:

  • transformers: 4.38.1/4.37.0;
  • torch: 2.2.0+cpu;

    Chinese

    When using the 2048 prompt (2048 prompts are truncated to 1024) with original transformers, pytorch and remove low_bit, the output looks normal. prompt: 患者

    1)中性粒细胞比例增高常见于急性感染、严重组织损伤、白血病、恶性肿瘤、类白血病反应、骨髓增殖性疾病等;2)嗜酸性粒细胞比例增高见于寄生虫感染、过敏反应、皮肤病、慢性粒细胞白血病、嗜酸粒细胞增多症等;3)淋巴细胞比例增高见于病毒感染、结缔组织病、免疫缺陷病、血液系统疾病、某些药物反应等;4)单核细胞比例增高见于某些感染、血液系统疾病、急性白血病、恶性肿瘤、类白血

    prompt: 红楼梦

    他们进了园门,但见异彩纷呈,楼阁参差,真是仙境。士隐跟着二仙,转过山坡,来到一座楼前,只见门额上写着“薄命司”三个字。和尚说:“这里就是咱们要办的事了。”\n士隐随着和尚进了楼,只见里面摆着许多签筒,签筒里装着各色签子。和尚说:“你抽一支签,看看你的命运如何。”士隐随手拿起一支签,签上写着:“甄士隐梦幻识通灵,贾雨村风尘怀闺秀。”
Uxito-Ada commented 8 months ago

Environment:

  • transformers: 4.38.1/4.37.0;
  • torch: 2.2.0+cpu;

Chinese

When using the 2048 prompt (2048 prompts are truncated to 1024) with original transformers and pytorch, the output looks normal. prompt: 患者

1)中性粒细胞比例增高常见于急性感染、严重组织损伤、白血病、恶性肿瘤、类白血病反应、骨髓增殖性疾病等;2)嗜酸性粒细胞比例增高见于寄生虫感染、过敏反应、皮肤病、慢性粒细胞白血病、嗜酸粒细胞增多症等;3)淋巴细胞比例增高见于病毒感染、结缔组织病、免疫缺陷病、血液系统疾病、某些药物反应等;4)单核细胞比例增高见于某些感染、血液系统疾病、急性白血病、恶性肿瘤、类白血

prompt: 红楼梦

他们进了园门,但见异彩纷呈,楼阁参差,真是仙境。士隐跟着二仙,转过山坡,来到一座楼前,只见门额上写着“薄命司”三个字。和尚说:“这里就是咱们要办的事了。”\n士隐随着和尚进了楼,只见里面摆着许多签筒,签筒里装着各色签子。和尚说:“你抽一支签,看看你的命运如何。”士隐随手拿起一支签,签上写着:“甄士隐梦幻识通灵,贾雨村风尘怀闺秀。”

what is the torch version? torch==2.2.0?

ivy-lv11 commented 8 months ago

Environment:

  • transformers: 4.38.1/4.37.0;
  • torch: 2.2.0+cpu;

Chinese

When using the 2048 prompt (2048 prompts are truncated to 1024) with original transformers and pytorch, the output looks normal. prompt: 患者

1)中性粒细胞比例增高常见于急性感染、严重组织损伤、白血病、恶性肿瘤、类白血病反应、骨髓增殖性疾病等;2)嗜酸性粒细胞比例增高见于寄生虫感染、过敏反应、皮肤病、慢性粒细胞白血病、嗜酸粒细胞增多症等;3)淋巴细胞比例增高见于病毒感染、结缔组织病、免疫缺陷病、血液系统疾病、某些药物反应等;4)单核细胞比例增高见于某些感染、血液系统疾病、急性白血病、恶性肿瘤、类白血

prompt: 红楼梦

他们进了园门,但见异彩纷呈,楼阁参差,真是仙境。士隐跟着二仙,转过山坡,来到一座楼前,只见门额上写着“薄命司”三个字。和尚说:“这里就是咱们要办的事了。”\n士隐随着和尚进了楼,只见里面摆着许多签筒,签筒里装着各色签子。和尚说:“你抽一支签,看看你的命运如何。”士隐随手拿起一支签,签上写着:“甄士隐梦幻识通灵,贾雨村风尘怀闺秀。”

what is the torch version? torch==2.2.0?

Yes.

Uxito-Ada commented 8 months ago

Environment:

  • transformers: 4.38.1/4.37.0;
  • torch: 2.2.0+cpu;

Chinese

When using the 2048 prompt (2048 prompts are truncated to 1024) with original transformers, pytorch and remove low_bit, the output looks normal. prompt: 患者

1)中性粒细胞比例增高常见于急性感染、严重组织损伤、白血病、恶性肿瘤、类白血病反应、骨髓增殖性疾病等;2)嗜酸性粒细胞比例增高见于寄生虫感染、过敏反应、皮肤病、慢性粒细胞白血病、嗜酸粒细胞增多症等;3)淋巴细胞比例增高见于病毒感染、结缔组织病、免疫缺陷病、血液系统疾病、某些药物反应等;4)单核细胞比例增高见于某些感染、血液系统疾病、急性白血病、恶性肿瘤、类白血

prompt: 红楼梦

他们进了园门,但见异彩纷呈,楼阁参差,真是仙境。士隐跟着二仙,转过山坡,来到一座楼前,只见门额上写着“薄命司”三个字。和尚说:“这里就是咱们要办的事了。”\n士隐随着和尚进了楼,只见里面摆着许多签筒,签筒里装着各色签子。和尚说:“你抽一支签,看看你的命运如何。”士隐随手拿起一支签,签上写着:“甄士隐梦幻识通灵,贾雨村风尘怀闺秀。”

removing load_in_low_bit and optimize_model runs FP32. If FP32 gave normal outputs, the issue can be related to INT4, which can be compared with Llama.cpp etc. And BF16 can be compared with native Pytorch BF16 support.

ivy-lv11 commented 8 months ago

Use transformers and bf16 by pytorch_autocast_bf16 API in all-in-one benchmark : the output also looks normal.

他们进了园门,但见异彩纷呈,楼阁参差,真是仙境。士隐随着二仙,转过山坡,来到一座楼前,只见一位仙姑端坐在楼上,旁边有一个丫鬟捧着茶盘。仙姑见了士隐,笑道:“甄士隐,你来了。”士隐忙施礼,问道:“仙姑如何认得我?”仙姑说:“你忘了,我在警幻仙子处见过你,还赠过你《好了歌》呢。”士隐这才想起,忙问仙姑:“仙姑为何赠我《好了歌
1)中性粒细胞比例增高常见于急性感染、严重组织损伤、白血病、恶性肿瘤、类白血病反应、骨髓增殖性疾病等;2)嗜酸性粒细胞比例增高见于寄生虫感染、过敏反应、皮肤病、慢性粒细胞白血病、嗜酸粒细胞增多症等;3)淋巴细胞比例增高见于病毒感染、结缔组织病、免疫缺陷病、血液系统疾病、某些化学物质或药物中毒等;4)单核细胞比例增高见于某些感染、血液系统疾病、急性炎症、慢性粒细胞白血
Uxito-Ada commented 8 months ago

After disabling overriding of qwen2 attention forward (qwen1.5 enjoys a model type of qwen2) in convert.py, normal answer can be generated on SPR:

两旁是一副对联:\n假作真时真亦假,无为有处有还无。\n二人进了里面,见是一座楼阁,楼内挂着“薄命司”的牌子。士隐抬头一看,见里面挂着许多签,签上写着名字,旁边注着诗句和判词。他见签上有个“甄英莲”的名字,就抽出来看,上面写着:\n娇嫩花朵偏遭风雨,聪明女儿薄命终身。\n原是仙家遗种,却落在草莽人家。生于富贵,却死于贫贱。这是她的命,无可奈何。士隐看了,叹了一口气,把签放下。又见一个签上写着“贾

Need to check what is wrong in qwen2_attention_forward_origin.

ivy-lv11 commented 8 months ago

Test BigDL-LLM 2.5.0b20240311

Envirionment:

On arc the output looks normal:

1)正常生理情况下,中性粒细胞比例偏高,提示有感染或炎症;2)单核细胞比例偏高,提示有慢性炎症、结核病、白血病等。

However, when running on CPU the output still looks abnormal.

临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断临床实验室实验室诊断
Uxito-Ada commented 8 months ago

It is found CPU uses different attention module from GPU, Qwen2SdpaAttention, which applies scaled dot product on qkv, and if converted with Qwen2Attention will never give right output.

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 4096)
    (layers): ModuleList(
      (0-31): 32 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
          (k_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
          (v_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
          (o_proj): LowBitLinear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): LowBitLinear(in_features=4096, out_features=11008, bias=False)
          (up_proj): LowBitLinear(in_features=4096, out_features=11008, bias=False)
          (down_proj): LowBitLinear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): LowBitLinear(in_features=4096, out_features=151936, bias=False)
)
ivy-lv11 commented 8 months ago

Model architecture

GPU

Use Qwen2attention

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 4096)
    (layers): ModuleList(
      (0-31): 32 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=True)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=True)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=True)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)

CPU

use Qwen2SdpaAttention

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 4096)
    (layers): ModuleList(
      (0-31): 32 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
          (k_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
          (v_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
          (o_proj): LowBitLinear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): LowBitLinear(in_features=4096, out_features=11008, bias=False)
          (up_proj): LowBitLinear(in_features=4096, out_features=11008, bias=False)
          (down_proj): LowBitLinear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): LowBitLinear(in_features=4096, out_features=151936, bias=False)
)
Uxito-Ada commented 8 months ago

Fixed in #10395 and #10409, and new cpu performance data is here.