Does the FastChat framework support multiple NPU reasoning? I changed the value of num_gpus to 4, but after he loads the model, it will not distribute it equally to each graphics card.

xunmenglt commented 3 months ago

xunmenglt commented 3 months ago

我尝试解决了这个问题，仅供参考😀

服务器环境：

:::info Linux Ascend910B01 4.19.90-24-4.v2101.ky10.aarch64 GNU/Linux NPU驱动版本：24.1.rc1 CANN版本： 8.0 :::

背景

适配huggingface版本的千问2模型（Qwen2-7B-Instruct）至910B华为服务器上，并利用FastChat框架完成模型服务OpenAI接口部署。然而现有的FastChat框架只支持单个NPU显卡对模型加载和模型推理。为解决以上问题，我对FastChat源代码文件进行了修改，具体如下（不保证性能）所示。

代码文件修改情况

修改文件：fastchat/model/model_adapter.py

def load_model(
model_path: str,
device: str = "cuda",
num_gpus: int = 1,
max_gpu_memory: Optional[str] = None,
dtype: Optional[torch.dtype] = None,
load_8bit: bool = False,
cpu_offloading: bool = False,
gptq_config: Optional[GptqConfig] = None,
awq_config: Optional[AWQConfig] = None,
exllama_config: Optional[ExllamaConfig] = None,
xft_config: Optional[XftConfig] = None,
revision: str = "main",
debug: bool = False,
):
# ......
if device == "cpu":
# ......
elif device == "npu":
    kwargs = {"torch_dtype": torch.float16}
    # Try to load ipex, while it looks unused, it links into torch for xpu support
    try:
        import torch_npu
        if num_gpus != 1:
            kwargs["device_map"] = "auto"
    except ImportError:
        warnings.warn("Ascend Extension for PyTorch is not installed.")
else:
    raise ValueError(f"Invalid device: {device}")

# Load model
model, tokenizer = adapter.load_model(model_path, kwargs)

if ((device == "cuda" and not cpu_offloading) or device in (
    "mps",
    "xpu",
    "npu",
)) and num_gpus == 1:
    model.to(device)

# ......

return model, tokenizer

运行脚本

model_path="/path/to/model/Qwen2-7B-Instruct"
ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 python -m fastchat.serve.cli --model-path $model_path --num-gpus 4 --device npu

注意：fastchat.serve.model_worker 脚本也适用

wrennywang commented 2 months ago

@xunmenglt 您好，可以确认推理过程KV cache会使用到4张卡吗？我在驱动23.0.3上只看到模型加载到多张卡上，但是当输入较大，超出一张卡的显存时会报错返回

xunmenglt commented 1 month ago

@wrennywang 我这边会使用到

lm-sys / FastChat