modelscope / ms-swift

Use PEFT or Full-parameter to finetune 350+ LLMs or 90+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
3.63k stars 312 forks source link

qwen2.5 vllm engine fail to load on multi-gpu cards #2131

Open ff1Zzd opened 3 days ago

ff1Zzd commented 3 days ago

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图) I am using get_vllm_engine to load qwen2.5 72b int4, and I have specified using multiple cuda cards by using os.environ. But i noticed that it is still loading on a single gpu card.

Here is my code.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1,2,3,4,5,6,7,8'
os.environ['USE_HF'] = 'True'

from swift.llm import (
    ModelType, get_vllm_engine, get_default_template_type,
    get_template, inference_vllm, inference_stream_vllm
)

model_type = ModelType.qwen2_5_72b_instruct_gptq_int4
model_id_or_path = None
llm_engine = get_vllm_engine(model_type, model_id_or_path=model_id_or_path)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
# 与`transformers.GenerationConfig`类似的接口
llm_engine.generation_config.max_new_tokens = 256
llm_engine.generation_config.do_sample = False
generation_info = {}

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等) GPU: H100 CUDA version: 12.2 vllm: 0.6.1.post2 transformers: 4.44.2 torch:2.4.0

Additional context Add any other context about the problem here(在这里补充其他信息)

Jintao-Huang commented 2 days ago

https://swift.readthedocs.io/zh-cn/latest/LLM/VLLM%E6%8E%A8%E7%90%86%E5%8A%A0%E9%80%9F%E4%B8%8E%E9%83%A8%E7%BD%B2.html#python