intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k stars 1.24k forks source link

ipex-llm inference with deepspeed of Qwen1.5-32B consumes too many memory #11635

Open Fred-cell opened 1 month ago

Fred-cell commented 1 month ago

optimize memory utilization for DP-AP to less than 256GB

qiyuangong commented 1 month ago

32B model requires 64GB memory in FP16 format.

This peak memory usage is caused by multiple processors converting the model in CPU memory. If we use 4 GPUs, we will have 4 processors reading and converting models in CPU memory (i.e., 64 GB* 4).

We can add a few sleep before loading LLM model (https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py#L69) to avoid peak memory.

import time
# Use number 10 as an example. Please modify it based on model size
if local_rank != 0:
   time.sleep(10 * local_rank)
oldmikeyang commented 1 month ago

This doesn't fix the issue.

In the benchmark code, deepspeed.init_inference is the sync call. If you add sleep before this call, all the process will be synced after this call. But the process release the malloced memory only after move the model to XPU. Move the model to XPU is called after the deepspeed.init_inference

qiyuangong commented 1 month ago

This doesn't fix the issue.

In the benchmark code, deepspeed.init_inference is the sync call. If you add sleep before this call, all the process will be synced after this call. But the process release the malloced memory only after move the model to XPU. Move the model to XPU is called after the deepspeed.init_inference

Hi @oldmikeyang

Thank you for your feedback!

Yes, deepspeed.init_inference is a sync method. In our local test, sleep reduces a few memory usage (peak memory caused by parallel transformers load model). The main memory block (model * rank_num) usage still exist. Will check with @plusbang about other solutions.

qiyuangong commented 1 month ago

Hi @oldmikeyang

As an alternative, please add a larger swap in case OOM. SWAP to SSD is quite fast, especially for large memory blocks. These memory will be free up after moving to xpu.

oldmikeyang commented 1 month ago

The Linux kernel have some issue on the swap. I am using the Ubuntu 22.04 with both 6.5 and 6.8 kernel. If we use 512G swap disk file, the Linux kernel will crash, due the kswapd crash. I had file a bug to Ubuntu community. https://bugs.launchpad.net/ubuntu/+source/compiz-plugins-main/+bug/2076602

qiyuangong commented 1 month ago

The Linux kernel have some issue on the swap. I am using the Ubuntu 22.04 with both 6.5 and 6.8 kernel. If we use 512G swap disk file, the Linux kernel will crash, due the kswapd crash. I had file a bug to Ubuntu community. https://bugs.launchpad.net/ubuntu/+source/compiz-plugins-main/+bug/2076602

Hi @oldmikeyang

Expected max memory usage in single processor: 80GB 32GB FP16 models (64GB) + 4bit models (16GB). 4 ARCs will launch 4 processor. It should use at most 320GB total memory. Why did your application use up 512GB swap?

oldmikeyang commented 1 month ago

After search the Deepspeed document, there are some solution to reduce the Host CPU memory.

https://joe-cecil.com/using-meta-tensors-to-load-models-that-dont-fit-in-memory/ Using meta tensors to load models that don't fit in memory PyTorch recently implemented a feature called meta tensors. These are tensors without any data. As I understand it, a meta tensor is just a shape and some hooks for recording operations performed on itself. If you add, subtract, multiply, etc. two meta tensors, you get another meta tensor. Probably using a meta tensor with a real tensor also produces a meta tensor.

Meta tensors enable one nice feature, which is the ability to load (in each process) only the rank-relevant weights. This is nice because for large models, we might be able to fit the whole thing into CPU memory, and we might be able to fit it in GPU memory after partitioning across all K of our GPUs, but we can't fit K complete copies of our large model in memory. I'm in this situation right now at work. To load the relevant model weights into GPU memory, each process has to load only the relevant weights.

DeepSpeed supports loading the rank-relevant weights. You initialize the model from its config, creating the tensors as meta tensors using something like deepspeed.OnDevice(device="meta"). This gives you a model object you can pass to DeepSpeed. You then pass a checkpoint base path and a checkpoint config path to DeepSpeed. Each rank will then load only its own parameters.

https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/README.md

DSPipeline utility class The text-generation examples make use of the DSPipeline utility class, a class that helps with loading DeepSpeed meta tensors and is meant to mimic the Hugging Face transformer pipeline.

The BLOOM model is quite large and the way DeepSpeed loads checkpoints for this model is a little different than other HF models. Specifically, we use meta tensors to initialize the model before loading the weights:

with deepspeed.OnDevice(dtype=self.dtype, device="meta"): This reduces the total system/GPU memory needed to load the model across multiple GPUs and makes the checkpoint loading faster. The DSPipeline class helps to load the model and run inference on it, given these differences.