intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.76k stars 1.27k forks source link

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Open Vasud-ha opened 8 months ago

Vasud-ha commented 8 months ago

The Intel GPU Flex 140 has two GPUs per card, with a memory capacity of 12 GB (6GB per GPU). Currently, I can do the inference only on one GPU device with limited memory. Could you please guide to run the model inference on two cards using deepspeed with neural-chat as done in these samples https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/Deepspeed-AutoTP

plusbang commented 8 months ago

Hi, to run neural-chat 7b inference using DeepSpeed AutoTP and our low-bit optimization, you could follow these steps:

1) Prepare your environment following installation steps. Especially for neural-chat-7b model, you need to run pip install transformers==4.34.0 additionally.

2) Currently, you need to modify https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py#L85 to model = optimize_model(model.module.to(f'cpu'), low_bit=low_bit, optimize_llm=False).to(torch.float16) Important: This PR(https://github.com/intel-analytics/ipex-llm/pull/10527) is used to support the default optimize_llm=True case. If you use the later version including this fix, then you could skip the step2.

3) Directly use the following script to run on two GPUs:

  export MASTER_ADDR=127.0.0.1
  export FI_PROVIDER=tcp
  export CCL_ATL_TRANSPORT=ofi
  export CCL_ZE_IPC_EXCHANGE=sockets

  export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD}
  basekit_root=/opt/intel/oneapi
  source $basekit_root/setvars.sh --force
  source $basekit_root/ccl/latest/env/vars.sh --force

  NUM_GPUS=2 # number of used GPU
  export USE_XETLA=OFF
  export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
  export TORCH_LLM_ALLREDUCE=0 # Different from PVC

  mpirun -np $NUM_GPUS --prepend-rank \
      python deepspeed_autotp.py --repo-id-or-model-path 'Intel/neural-chat-7b-v3' --low-bit 'sym_int4'

Please have a try and feel free to let me know any other question.

Vasud-ha commented 8 months ago

Hi @plusbang , I can see 3 GPUs (2 cards of Flex 140 having 6GB of memory each and 1 Flex 170) on my system image , while running neural-chat 7b with deepspeed getting out of resource error. image However, GPU memory utilization is only 50% on the devices. image image

plusbang commented 8 months ago

Device 0 and 1 are used by default in our script. Please refer to here for more details about how to select devices.

According to my experiment on 2 A770, ~3G is used per GPU if you run neural-chat-7B with sym_int4 and default input prompt in the example. According to your error message, python=3.10 is used. We recommend to create python=3.9 env following our steps and run pip install transformers==4.34.0 for this model additionally.

Vasud-ha commented 8 months ago

Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?

glorysdj commented 8 months ago

Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?

We plan to add deepspeed+ipex-llm inference backend to FastChat serving, will get you updated once it's supported. Thanks.

glorysdj commented 6 months ago

@Vasud-ha we have added PEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi please refer to https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/deepspeed_autotp_fastapi_quickstart.html and https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Deepspeed-AutoTP-FastAPI