lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
37k stars 4.56k forks source link

Does the FastChat framework support multiple NPU reasoning? I changed the value of num_gpus to 4, but after he loads the model, it will not distribute it equally to each graphics card. #3459

Closed xunmenglt closed 3 months ago

xunmenglt commented 3 months ago

Does the FastChat framework support multiple NPU reasoning? I changed the value of num_gpus to 4, but after he loads the model, it will not distribute it equally to each graphics card.

xunmenglt commented 3 months ago

我尝试解决了这个问题,仅供参考😀

服务器环境:

:::info Linux Ascend910B01 4.19.90-24-4.v2101.ky10.aarch64 GNU/Linux NPU驱动版本:24.1.rc1 CANN版本: 8.0 :::

背景

适配huggingface版本的千问2模型(Qwen2-7B-Instruct)至910B华为服务器上,并利用FastChat框架完成模型服务OpenAI接口部署。然而现有的FastChat框架只支持单个NPU显卡对模型加载和模型推理。为解决以上问题,我对FastChat源代码文件进行了修改,具体如下(不保证性能)所示。

代码文件修改情况

wrennywang commented 2 months ago

@xunmenglt 您好,可以确认推理过程KV cache会使用到4张卡吗?我在驱动23.0.3上只看到模型加载到多张卡上,但是当输入较大,超出一张卡的显存时会报错返回

xunmenglt commented 1 month ago

@wrennywang 我这边会使用到