lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.69k stars 4.52k forks source link

使用vllm_worker进行模型加载,卡着不动 #2833

Open wfs420100 opened 10 months ago

wfs420100 commented 10 months ago

问题描述

使用第2第3块gpu启动时,卡着不动(而使用1 2、1 3的两两组合则没有问题)

cuda版本:12.1.0 Driver版本: 535.54.03 torch: 2.1.2 fschat: 0.2.34 vllm: 0.2.6 ray: 2.8.1

启动命令

CUDA_VISIBLE_DEVICES="2,3" python -m fastchat.serve.vllm_worker \
  --model-names="qwen-72b-chat" \
  --model-path="/Models/Qwen-72B-Chat" \
  --controller-address=${CONTROLLER_ADDRESS} \
  --worker-address=${WORKER_ADDRESS} \
  --host=${WORKER_HOST} \
  --port=${WORKER_PORT} \
  --trust-remote-code \
  --gpu-memory-utilization=0.98 \
  --dtype=bfloat16 \
  --tensor-parallel-size=2 \
  > z_server_worker.log 2>&1

日志信息

2023-12-19 07:10:25,057 INFO worker.py:1673 -- Started a local Ray instance.
INFO 12-19 07:10:27 llm_engine.py:73] Initializing an LLM engine with config: model='/Models/Qwen-72B-Chat', tokenizer='/Models/Qwen-72B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
WARNING 12-19 07:10:28 tokenizer.py:62] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.

nvidia-smi

image

slliao445 commented 3 months ago

一样的问题,请问找到原因和解决办法了吗

wfs420100 commented 3 months ago

一样的问题,请问找到原因和解决办法了吗

可能是gpu卡之间的通信问题(不确定),换不同的卡id组合解决了