Closed xunmenglt closed 3 months ago
我尝试解决了这个问题,仅供参考😀
:::info Linux Ascend910B01 4.19.90-24-4.v2101.ky10.aarch64 GNU/Linux NPU驱动版本:24.1.rc1 CANN版本: 8.0 :::
适配huggingface版本的千问2模型(Qwen2-7B-Instruct)至910B华为服务器上,并利用FastChat框架完成模型服务OpenAI接口部署。然而现有的FastChat框架只支持单个NPU显卡对模型加载和模型推理。为解决以上问题,我对FastChat源代码文件进行了修改,具体如下(不保证性能)所示。
修改文件:fastchat/model/model_adapter.py
def load_model(
model_path: str,
device: str = "cuda",
num_gpus: int = 1,
max_gpu_memory: Optional[str] = None,
dtype: Optional[torch.dtype] = None,
load_8bit: bool = False,
cpu_offloading: bool = False,
gptq_config: Optional[GptqConfig] = None,
awq_config: Optional[AWQConfig] = None,
exllama_config: Optional[ExllamaConfig] = None,
xft_config: Optional[XftConfig] = None,
revision: str = "main",
debug: bool = False,
):
# ......
if device == "cpu":
# ......
elif device == "npu":
kwargs = {"torch_dtype": torch.float16}
# Try to load ipex, while it looks unused, it links into torch for xpu support
try:
import torch_npu
if num_gpus != 1:
kwargs["device_map"] = "auto"
except ImportError:
warnings.warn("Ascend Extension for PyTorch is not installed.")
else:
raise ValueError(f"Invalid device: {device}")
# Load model
model, tokenizer = adapter.load_model(model_path, kwargs)
if ((device == "cuda" and not cpu_offloading) or device in (
"mps",
"xpu",
"npu",
)) and num_gpus == 1:
model.to(device)
# ......
return model, tokenizer
model_path="/path/to/model/Qwen2-7B-Instruct"
ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 python -m fastchat.serve.cli --model-path $model_path --num-gpus 4 --device npu
注意:fastchat.serve.model_worker 脚本也适用
@xunmenglt 您好,可以确认推理过程KV cache会使用到4张卡吗?我在驱动23.0.3上只看到模型加载到多张卡上,但是当输入较大,超出一张卡的显存时会报错返回
@wrennywang 我这边会使用到
Does the FastChat framework support multiple NPU reasoning? I changed the value of num_gpus to 4, but after he loads the model, it will not distribute it equally to each graphics card.