intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.59k stars 1.25k forks source link

qwen1.8B GPU memory usage is too high #9809

Open juan-OY opened 9 months ago

juan-OY commented 9 months ago

https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary

使用如上模型在显卡A770上运行,得到如下数据: 32in32 out peak GPU mem:3.1G 2048in512 out peak GPU mem:7.4G 4096in1024 out peak GPU mem:11.6G 8192in2048 out peak GPU mem: OOM

这个和官网INT4 的memory占用差距很大,是否能优化。 env: Linux 22.05 kernel5.19 OneAPI 2024.0 bigdl 2.5.0b20231218 ipex :2.1.10+xpu 复现方式: 转换Qwen模型到Low bit int4 使用benchmark脚本运行不同的输入测试性能。 部分测试代码: with torch.inference_mode(): torch.xpu.synchronize() prompt = QWEN_PROMPT_FORMAT.format(prompt=prompt) input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') torch.xpu.synchronize()

ipex model needs a warmup, then inference time can be accurate

    output = model.generate(input_ids,
                            max_new_tokens=args.n_predict)

for i in range(5):
        st = time.time()
        torch.xpu.synchronize()
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, do_sample=False, max_new_tokens=args.n_predict)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        torch.xpu.synchronize()
        end = time.time()
        print(f"cost {end - st:.4f}s")
        print(output_str)
hkvision commented 9 months ago

Hi, https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary in this page seems the memory is for 1 token in and 2048/8192 token out.

We will reproduce this result and update our results here.

Ricky-Ting commented 8 months ago

We used https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py to reproduce the results.

Here are the results on nvidia's gpu: the memory usage reported by torch.cuda.max_memory_allocated matches the official report. But the memory usage reported by nvidia-smi is a little larger. model device in-out torch.cuda.max_memory_allocated nvidia-smi
Qwen-1_8B-Chat-Int4 RTX4090 1-2048 2.91 GB 3.62 GB
Here are the results on intel's A770 using bigdl-llm: the memory usage is a little larger than the official report, but it is reasonable. We are continuing to optimize qwen’s memory footprint. Model Device in-out torch.xpu.max_memory_allocated xpu-smi
Qwen-1_8B-Chat Arc 770 1-2048 3.34GB 4.01GB
hkvision commented 8 months ago

For longer sequence input (1k/2k/4k,...), bigdl-llm uses larger memory than the official model. We will look into this.

juan-OY commented 8 months ago

thanks for the update.

hkvision commented 6 months ago

Hi, sorry for the late reply. One difference is that for the official int4 model, it is using w4a16, but previously when you run with ipex-llm, we are using w4a32, so you need to add model = model.half() after loading the model before putting it on xpu. We have optimized our memory usage and compared with RTX4090, the memory usage is reasonable/compatible now. Please have a check with the latest ipex-llm :)