Closed Mushtaq-BGA closed 7 months ago
Hi @Mushtaq-BGA
This error log indicates GPU memory is not enough. Please check if flex related envs are correctly set, https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral#31-configurations-for-linux.
To reduce GPU memory usage, we can add an env to turn on kv cache quantization.
export BIGDL_QUANTIZE_KV_CACHE=1
(Optional) Move the embedding layer to CPU with cpu_embedding=True
model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True,
optimize_model=True,
trust_remote_code=True,
cpu_embedding=True,
use_cache=True)
The Intel® Data Center GPU Flex 140 accelerator has two GPUs on a single card, supporting heterogeneous vGPU profiles. https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html
Flex 140 has 2 GPUs on a single card. That means it will split 12GB GPU memory into 2 * 6GB. Unfortunately, 6GB (5070MB actually) is a bit tough for mistral.
On Flex 170, we found that the mistral example requires > 6GB memory for inference.
Hi @Mushtaq-BGA ,
Although one FLEX 140 GPU cannot serve Mistral due to insufficient memory, you can try Deepspeed Tensor Parallelism, which supports large model inference on top of multiple GPUs. I have verified on FLEX 170, where the GPU memory consumption seems feasible on FLEX 140 as well.
Here is a guide about how to run IPEX-LLM INT4 models with Deepspeed TP.
As FLEX shares a similar architecture with Arc, you can port it to FLEX by using the script run_vicuna_33b_arc_2_card.sh. The model path in the script can be reset as needed.
Find me if any issue.
Hi, I am trying to run sample mistral on GPU ( Flex 140) and the sample is throwing an error as below
"RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)"
Attached screenshot and env-check log
env-check.log