2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.45k stars 1.24k forks source link

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue #10926

Open Fred-cell opened 4 months ago

Fred-cell commented 4 months ago

the 2nd latency of llama3-8b-instruct with int4 and bs=1 is larger than bs=2, ipex-llm=2.5.0b20240504

lalalapotter commented 4 months ago

Already reproduce the issue, and will fix it later. We recommend you use fp16 for non-linear layer, please refer to benchmark scripts all-in-one, and select transformer_int4_fp16_gpu API.