intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.48k stars 1.24k forks source link

2 GPU settings for Llama2_7b is not working, per XPU-SMI device 0 @ 99% and device 1 @ 0% during execution --resolved #10538

Closed gbertulf closed 4 months ago

gbertulf commented 5 months ago

I followed the steps from this github link -- https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/README.md and attempted to verify 2 GPU inference runs on these Token Combinations

1) Initial run using default script with sym-int4 and 32 tokens

image

Note: This run just used one GPU as the world_size is 1 here.

2) Also tried with sym int4 , sym int8 , fp8 , fp16 on token sizes 2048x128 , 2048x 256. The general observation using XPU-SMI is shown below,

image

Notes: 1) Top is xpu-smi dump on device 0 and bottom portion is for xpu-smi dump on device 1 2) Notice device 0 is showing 99% GPU utilization vs device 1 showing close to 0% utilization

Kindly advise if there is any intermediate step to apply to achieve expected 2 GPU processes.

Please note that I am running the inference on this system -->NF5468-M6 with 8x Intel Flex GPU 170 Full system spec details are available here -> https://wiki.ith.intel.com/display/MediaWiki/Flex-170x8+%28Inspur+-+ICX%29+Qualification

yangw1234 commented 5 months ago

Synced with @gbertulf offline, this problem is caused by a bash grammar error in the starting script. After the fix, both devices are busy, but there still exists this error: image

gbertulf commented 4 months ago

Issue is resolved. Closing this ticket. Thank you team for your help.