vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.97k stars 4.71k forks source link

ConnectionResetError: [Errno 104] Connection reset by peer #3115

Open allenhaozi opened 9 months ago

allenhaozi commented 9 months ago

Occasionally encounter errors

+ python3 -m vllm.entrypoints.openai.api_server --host xxxxx --port 8003 --served-model-name qwen1.5-72b-chat-int4 --model /home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4 --trust-remote-code --tokenizer-mode auto --max-num-batched-tokens 32768 --tensor-parallel-size 4
INFO 02-29 14:28:09 api_server.py:228] args: Namespace(host='xxxxx', port=8003, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen1.5-72b-chat-int4', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=32768, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 02-29 14:28:09 config.py:186] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-29 14:28:09 config.py:421] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-02-29 14:28:12,795 INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-29 14:28:15 llm_engine.py:87] Initializing an LLM engine with config: model='/home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4', tokenizer='/home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 236, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/vllm/engine/async_llm_engine.py", line 625, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/vllm/engine/async_llm_engine.py", line 321, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 366, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 126, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 303, in _init_workers_ray
    self._run_workers("init_model",
  File "/workspace/vllm/engine/llm_engine.py", line 1036, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/vllm/worker/worker.py", line 94, in init_model
    init_distributed_environment(self.parallel_config, self.rank,
  File "/workspace/vllm/worker/worker.py", line 275, in init_distributed_environment
    cupy_utils.init_process_group(
  File "/workspace/vllm/model_executor/parallel_utils/cupy_utils.py", line 90, in init_process_group
    _NCCL_BACKEND = NCCLBackendWithBFloat16(world_size, rank, host, port)
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_nccl_comm.py", line 70, in __init__
    self._init_with_tcp_store(n_devices, rank, host, port)
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_nccl_comm.py", line 94, in _init_with_tcp_store
    self._store_proxy['nccl_id'] = shifted_nccl_id
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_store.py", line 148, in __setitem__
    self._send_recv(_store_actions.Set(key, value))
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_store.py", line 130, in _send_recv
    result_bytes = s.recv(sizeof(
ConnectionResetError: [Errno 104] Connection reset by peer
mzz12 commented 8 months ago

Hi,

Do you solve this problem? I encounter the same problem too.

allenhaozi commented 8 months ago

Hi,

Do you solve this problem? I encounter the same problem too.

This issue hasn't been resolved, I encounter it occasionally.

rkooo567 commented 8 months ago

I think there's an issue with cupy backend that's used for tensor parallelism.

If you use enforce_eager=True, it is likely resolved (though it will affect performance). Regarding error itself, I think https://github.com/cupy/cupy is probably a better place to report an error.

mzz12 commented 8 months ago

I think there's an issue with cupy backend that's used for tensor parallelism.

If you use enforce_eager=True, it is likely resolved (though it will affect performance). Regarding error itself, I think https://github.com/cupy/cupy is probably a better place to report an error.

Hi,

Thanks for your suggestion. I will report on cupy. However, since the vllm must run successfully somehow in multi-node environment before release while I fail to deploy every time, I think there must be something that is related to environment hinder the deployment.

rkooo567 commented 8 months ago

I think the cupy backend has been introduced lately for cuda graph (which is disabled by force_eager=True). My guess is this backend is not working well in some environments, but it is pretty difficult to troubleshoot without reproducing the issue. If you can tell me your instance details, I can try repro

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!