intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.76k stars 1.27k forks source link

Inference on GPU Error - [RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)] #10604

Closed Mushtaq-BGA closed 7 months ago

Mushtaq-BGA commented 8 months ago

Hi, I am trying to run sample mistral on GPU ( Flex 140) and the sample is throwing an error as below

"RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)"

Attached screenshot and env-check log

image

env-check.log

qiyuangong commented 8 months ago

Hi @Mushtaq-BGA

This error log indicates GPU memory is not enough. Please check if flex related envs are correctly set, https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral#31-configurations-for-linux.

To reduce GPU memory usage, we can add an env to turn on kv cache quantization.

export BIGDL_QUANTIZE_KV_CACHE=1

(Optional) Move the embedding layer to CPU with cpu_embedding=True

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             optimize_model=True,
                                             trust_remote_code=True,
                                             cpu_embedding=True,
                                             use_cache=True)
qiyuangong commented 8 months ago

The Intel® Data Center GPU Flex 140 accelerator has two GPUs on a single card, supporting heterogeneous vGPU profiles. https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html

Flex 140 has 2 GPUs on a single card. That means it will split 12GB GPU memory into 2 * 6GB. Unfortunately, 6GB (5070MB actually) is a bit tough for mistral.

On Flex 170, we found that the mistral example requires > 6GB memory for inference.

Uxito-Ada commented 8 months ago

Hi @Mushtaq-BGA ,

Although one FLEX 140 GPU cannot serve Mistral due to insufficient memory, you can try Deepspeed Tensor Parallelism, which supports large model inference on top of multiple GPUs. I have verified on FLEX 170, where the GPU memory consumption seems feasible on FLEX 140 as well.

Here is a guide about how to run IPEX-LLM INT4 models with Deepspeed TP.

As FLEX shares a similar architecture with Arc, you can port it to FLEX by using the script run_vicuna_33b_arc_2_card.sh. The model path in the script can be reset as needed.

Find me if any issue.