vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.62k stars 3.9k forks source link

[Bug]: 单gpu没有任何反应(设置tensor_parallel_size=1模型加载失败) #7136

Open efficentdet opened 1 month ago

efficentdet commented 1 month ago

Your current environment

问题

🐛 Describe the bug

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

import torch

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct')

texts = []
# Prepare your prompts
# 定义批量数据
prompts = [
    "宪法规定的公民法律义务有",
    "属于专门人民法院的是",
    "无效婚姻的种类包括",
    "刑事案件定义",
    "税收法律制度",
]
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)

sampling_params = SamplingParams(temperature=0.1, top_p=0.5, max_tokens=4096)
path = '/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct'
llm = LLM(model=path, trust_remote_code=True, tokenizer_mode="auto", tensor_parallel_size=2, dtype=torch.float16)
outputs = llm.generate(texts, sampling_params)

# 输出结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

上面这段代码执行没问题,但是当我把tensor_parallel_size从2改成1希望在单卡上面部署离线推理,执行到

llm = LLM(model=path, trust_remote_code=True, tokenizer_mode="auto", tensor_parallel_size=1, dtype=torch.float16)

这一步只会报如下显示,然后就会一直没有反应,也不保错: $sudo CUDA_VISIBLE_DEVICES=0 PYTHONPATH="/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/" python vllm_test.py
WARNING 08-05 11:02:58 config.py:1425] Casting torch.bfloat16 to torch.float16. INFO 08-05 11:02:58 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/ProjectRoot/long_content_LLM/qwen/Qwen2-1_5B-Instruct', speculative_config=None, tokenizer='/ProjectRoot/long_contentLLM/qwen/Qwen2-15B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-05 11:02:59 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-05 11:02:59 selector.py:54] Using XFormers backend. [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:36893 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [11-88-234-70.gpu-exporter.prometheus.svc.cluster.local]:36893 (errno: 97 - Address family not supported by protocol).

上面粗体是最后显示,然后就也不会报错一直这样,我该怎么解决,求求

efficentdet commented 1 month ago

最后报了超时错误,该怎么解决?真的很急 $sudo CUDA_VISIBLE_DEVICES=0 PYTHONPATH="/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/" python vllm_test.py WARNING 08-05 11:02:58 config.py:1425] Casting torch.bfloat16 to torch.float16. INFO 08-05 11:02:58 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/ProjectRoot/long_content_LLM/qwen/Qwen2-1_5B-Instruct', speculative_config=None, tokenizer='/ProjectRoot/long_contentLLM/qwen/Qwen2-15B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-05 11:02:59 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-05 11:02:59 selector.py:54] Using XFormers backend. [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:36893 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [11-88-234-70.gpu-exporter.prometheus.svc.cluster.local]:36893 (errno: 97 - Address family not supported by protocol). [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (11.88.234.70, 36893). Traceback (most recent call last): File "vllm_test.py", line 32, in llm = LLM(model=path, trust_remote_code=True, tensor_parallel_size=1, dtype=torch.float16) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 155, in init self.llm_engine = LLMEngine.from_engine_args( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args engine = cls( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 251, in init self.model_executor = executor_class( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor self.driver_worker.init_device() File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/worker/worker.py", line 132, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment init_distributed_environment(parallel_config.world_size, rank, File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/distributed/parallel_state.py", line 812, in init_distributed_environment torch.distributed.init_process_group( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper func_return = func(*args, **kwargs) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store return TCPStore( torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (11.88.234.70, 36893).

DC-Shi commented 1 month ago

It seems this is an address problem. You started server on IPv6 but failed(not shown started on IPv4) [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:36893 (errno: 97 - Address family not supported by protocol). But you connect the port through IPv4 [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (11.88.234.70, 36893).

You can try to change the IP by modifying environment variable VLLM_HOST_IP

eyuansu62 commented 1 day ago

@DC-Shi hello, i change the VLLM_HOST_IP to for example 0.0.0.0, but still fails. may i ask how do you change the VLLM_HOST_IP?