ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.02k stars 5.59k forks source link

[Ray Cluster] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error #1247 #44533

Closed NavinKumarMNK closed 5 months ago

NavinKumarMNK commented 5 months ago

What happened + What you expected to happen

example.py

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="/data/yi-34b", 
    dtype="float16", 
    tensor_parallel_size=4, 
    enforce_eager=True, 
    trust_remote_code=True, 
    load_format='safetensors',
    # quantization="AWQ",
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

when i run the script, (same error without first two lines in the below terminal commands)


root@vitccpowerai:/data# export NCCL_IB_DISABLE=1
root@vitccpowerai:/data# export NCCL_P2P_DISABLE=1
root@vitccpowerai:/data# NCCL_DEBUG=INFO python example.py 
WARNING 04-07 00:57:30 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-04-07 00:57:30,511 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 172.16.0.57:6379...
2024-04-07 00:57:30,527 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
INFO 04-07 00:57:30 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/data/yi-34b', tokenizer='/data/yi-34b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
vitccpowerai:7392:7392 [0] NCCL INFO Bootstrap : Using enP48p1s0f0:172.16.0.57<0>
vitccpowerai:7392:7392 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
vitccpowerai:7392:7392 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
vitccpowerai:7392:7392 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.6+cuda12.2
vitccpowerai:7392:7679 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
vitccpowerai:7392:7679 [0] NCCL INFO NET/Socket : Using [0]enP48p1s0f0:172.16.0.57<0> [1]br-1cd47c6ec214:172.18.0.1<0> [2]enP5p1s0f0:fe80::a94:efff:fe80:3939%enP5p1s0f0<0> [3]vethbe44c5f:fe80::4c7c:5dff:fec7:6249%vethbe44c5f<0>
vitccpowerai:7392:7679 [0] NCCL INFO Using network Socket
vitccpowerai:7392:7679 [0] NCCL INFO comm 0x944a9811b50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 404000 commId 0x7d2c7b51f234440f - Init START

vitccpowerai:7392:7679 [0] graph/xml.h:85 NCCL WARN Attribute busid of node nic not found
vitccpowerai:7392:7679 [0] NCCL INFO graph/xml.cc:585 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO graph/xml.cc:767 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO graph/topo.cc:655 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO init.cc:840 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO init.cc:1358 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
vitccpowerai:7392:7392 [0] NCCL INFO group.cc:406 -> 3
vitccpowerai:7392:7392 [0] NCCL INFO group.cc:96 -> 3
Traceback (most recent call last):
  File "/data/example.py", line 10, in <module>
    llm = LLM(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 146, in from_engine_args
    engine = cls(*engine_configs,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 103, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 60, in __init__
    self._init_workers_ray(placement_group)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 190, in _init_workers_ray
    self._run_workers("init_device",
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
    init_distributed_environment(self.parallel_config, self.rank,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
    torch.distributed.all_reduce(torch.zeros(1).cuda())
  File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Traceback (most recent call last):
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/ray_utils.py", line 38, in execute_method
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     return executor(*args, **kwargs)
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     torch.distributed.all_reduce(torch.zeros(1).cuda())
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     return func(*args, **kwargs)
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     work = group.allreduce([tensor], opts)
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] ncclInternalError: Internal check failed.
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Last error:
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Attribute busid of node nic not found
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Traceback (most recent call last):
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/ray_utils.py", line 38, in execute_method
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     return executor(*args, **kwargs)
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     torch.distributed.all_reduce(torch.zeros(1).cuda())
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     return func(*args, **kwargs)
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     work = group.allreduce([tensor], opts)
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] ncclInternalError: Internal check failed.
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Last error:
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Attribute busid of node nic not found
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Traceback (most recent call last):
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/ray_utils.py", line 38, in execute_method
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     return executor(*args, **kwargs)
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     torch.distributed.all_reduce(torch.zeros(1).cuda())
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     return func(*args, **kwargs)
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     work = group.allreduce([tensor], opts)
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] ncclInternalError: Internal check failed.
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Last error:
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Attribute busid of node nic not found

System Env:

root@vitccpowerai:/data# python3 collect_env.py 
Collecting environment information...
PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (ppc64le)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.35

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 16:04:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-101-generic-ppc64le-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture:                       ppc64le
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Model name:                         POWER9, altivec supported
Model:                              2.2 (pvr 004e 1202)
Thread(s) per core:                 4
Core(s) per socket:                 16
Socket(s):                          2
Frequency boost:                    enabled
CPU max MHz:                        3800.0000
CPU min MHz:                        2300.0000
L1d cache:                          1 MiB (32 instances)
L1i cache:                          1 MiB (32 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           160 MiB (16 instances)
NUMA node(s):                       6
NUMA node0 CPU(s):                  0-63
NUMA node8 CPU(s):                  64-127
NUMA node252 CPU(s):                
NUMA node253 CPU(s):                
NUMA node254 CPU(s):                
NUMA node255 CPU(s):                
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Mitigation; RFI Flush, L1D private per thread
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Mitigation; RFI Flush, L1D private per thread
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Vulnerability Spectre v2:           Mitigation; Indirect branch serialisation (kernel only)
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] onnx==1.16.0
[pip3] onnxruntime-gpu==1.15.1
[pip3] torch==2.1.2
[conda] cudatoolkit               11.8.0              hedcfb66_13    conda-forge
[conda] libmagma                  2.7.2                he288b6c_2    conda-forge
[conda] libmagma_sparse           2.7.2                h5b5c57a_3    conda-forge
[conda] magma                     2.7.2                h097a1ca_3    conda-forge
[conda] numpy                     1.24.3          py310h87cc683_0  
[conda] numpy-base                1.24.3          py310hac71eb6_0  
[conda] torch                     2.1.2                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: 7.0; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV3     SYS     SYS     0-63    0               N/A
GPU1    NV3      X      SYS     SYS     0-63    0               N/A
GPU2    SYS     SYS      X      NV3     64-127  8               N/A
GPU3    SYS     SYS     NV3      X      64-127  8               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

! This script runs smoothly when i manually create a single container. But when it is created as a part ray-cluster launch, its not working (i logged in to the head node and if i run, actually there was only head node in my config) ray-cluster.yaml

# This YAML file contains the configuration for a Ray cluster.
# It specifies the cluster name, provider type, IP addresses of the head and worker nodes,
cluster_name: xxx

# Run ray in containers
docker: 
  image: "ml-service"
  container_name: "ml-service"
  pull_before_run: false
  run_options:
    - --runtime=nvidia
    - --gpus all
    - --ipc=host
    - --privileged
    - -v "/data/xxx":"/data"
    - -p 8000:8000
    - --shm-size=128gb

# The 'provider' section specifies the type of provider and the IP addresses of the head node and worker nodes.
provider:
  type: local
  head_ip: xxx.xxx.xxx.xxx
  worker_ips: []

auth:
  ssh_user: root  # The SSH username for authentication
  ssh_private_key: ~/.ssh/id_rsa

min_workers: 0  # Minimum number of workers in the cluster
max_workers: 0  # Maximum number of workers in the cluster
upscaling_speed: 1.0  # Speed at which the cluster scales up
idle_timeout_minutes: 3  # Timeout in minutes for idle workers to be terminated

file_mounts: {
  "/app":"."  
}

rsync_exclude:
  - "**/.git"
  - "**/.git/**"
  - "*.tar.*"

rsync_filter:
  - ".gitignore"
  - "__pycache__"
  - "=*"
  - "*.pyc"
  - "*.tar.*"
  - ".git"

file_mounts_sync_continuously: true

# This YAML file contains the configuration for starting and stopping Ray clusters.
head_start_ray_commands:
  - ray stop
  - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0  --object-manager-port=8076

# The commands to start Ray workers in the cluster.
worker_start_ray_commands:
  - ray stop
  - export RAY_HEAD_IP && echo "export RAY_HEAD_IP=$RAY_HEAD_IP" >> ~/.bashrc && ray start --address=$RAY_HEAD_IP:6379   --object-manager-port=8076

Versions / Dependencies

ray version : 2.6.3 python: 3.10.13 torch: 2.1.2 vllm : 0.3.3

os : ubuntu-22.04::ppc64le

Reproduction script

example.py

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="/data/yi-34b", 
    dtype="float16", 
    tensor_parallel_size=4, 
    enforce_eager=True, 
    trust_remote_code=True, 
    load_format='safetensors',
    # quantization="AWQ",
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Running this script inside the ray-cluster started container. (works find without any error if i run manually by creating a container from the same image)

Issue Severity

High - Blocking My Project

NavinKumarMNK commented 5 months ago

In my case when container used the network bridge it works fine and when it got connect through network host it creates this specified problem.

The reason why this occurs in only ray-cluster started nodes, is because by default the containers use host network --net=host is passed by ray cluster while creating a container

sleepwalker2017 commented 4 months ago

hello, I have a question, when I create container using bridge, the ip in the container is wrong. So the ray head can't get connected with ray worker.

Could you share the command how you start container? Thank you !