Open Vasud-ha opened 8 months ago
Hi, to run neural-chat 7b inference using DeepSpeed AutoTP and our low-bit optimization, you could follow these steps:
1) Prepare your environment following installation steps. Especially for neural-chat-7b model, you need to run pip install transformers==4.34.0
additionally.
2) Currently, you need to modify https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py#L85 to model = optimize_model(model.module.to(f'cpu'), low_bit=low_bit, optimize_llm=False).to(torch.float16)
Important: This PR(https://github.com/intel-analytics/ipex-llm/pull/10527) is used to support the default optimize_llm=True
case. If you use the later version including this fix, then you could skip the step2.
3) Directly use the following script to run on two GPUs:
export MASTER_ADDR=127.0.0.1
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD}
basekit_root=/opt/intel/oneapi
source $basekit_root/setvars.sh --force
source $basekit_root/ccl/latest/env/vars.sh --force
NUM_GPUS=2 # number of used GPU
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0 # Different from PVC
mpirun -np $NUM_GPUS --prepend-rank \
python deepspeed_autotp.py --repo-id-or-model-path 'Intel/neural-chat-7b-v3' --low-bit 'sym_int4'
Please have a try and feel free to let me know any other question.
Hi @plusbang , I can see 3 GPUs (2 cards of Flex 140 having 6GB of memory each and 1 Flex 170) on my system , while running neural-chat 7b with deepspeed getting out of resource error. However, GPU memory utilization is only 50% on the devices.
Device 0 and 1 are used by default in our script. Please refer to here for more details about how to select devices.
According to my experiment on 2 A770, ~3G is used per GPU if you run neural-chat-7B with sym_int4 and default input prompt in the example. According to your error message, python=3.10
is used. We recommend to create python=3.9 env following our steps and run pip install transformers==4.34.0
for this model additionally.
Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?
Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?
We plan to add deepspeed+ipex-llm inference backend to FastChat serving, will get you updated once it's supported. Thanks.
@Vasud-ha we have added PEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi please refer to https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/deepspeed_autotp_fastapi_quickstart.html and https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Deepspeed-AutoTP-FastAPI
The Intel GPU Flex 140 has two GPUs per card, with a memory capacity of 12 GB (6GB per GPU). Currently, I can do the inference only on one GPU device with limited memory. Could you please guide to run the model inference on two cards using deepspeed with neural-chat as done in these samples https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/Deepspeed-AutoTP