[Ray Cluster] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error #1247

What happened + What you expected to happen

example.py

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="/data/yi-34b", 
    dtype="float16", 
    tensor_parallel_size=4, 
    enforce_eager=True, 
    trust_remote_code=True, 
    load_format='safetensors',
    # quantization="AWQ",
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

when i run the script, (same error without first two lines in the below terminal commands)


root@vitccpowerai:/data# export NCCL_IB_DISABLE=1
root@vitccpowerai:/data# export NCCL_P2P_DISABLE=1
root@vitccpowerai:/data# NCCL_DEBUG=INFO python example.py 
WARNING 04-07 00:57:30 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-04-07 00:57:30,511 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 172.16.0.57:6379...
2024-04-07 00:57:30,527 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
INFO 04-07 00:57:30 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/data/yi-34b', tokenizer='/data/yi-34b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
vitccpowerai:7392:7392 [0] NCCL INFO Bootstrap : Using enP48p1s0f0:172.16.0.57<0>
vitccpowerai:7392:7392 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
vitccpowerai:7392:7392 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
vitccpowerai:7392:7392 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.6+cuda12.2
vitccpowerai:7392:7679 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
vitccpowerai:7392:7679 [0] NCCL INFO NET/Socket : Using [0]enP48p1s0f0:172.16.0.57<0> [1]br-1cd47c6ec214:172.18.0.1<0> [2]enP5p1s0f0:fe80::a94:efff:fe80:3939%enP5p1s0f0<0> [3]vethbe44c5f:fe80::4c7c:5dff:fec7:6249%vethbe44c5f<0>
vitccpowerai:7392:7679 [0] NCCL INFO Using network Socket
vitccpowerai:7392:7679 [0] NCCL INFO comm 0x944a9811b50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 404000 commId 0x7d2c7b51f234440f - Init START

vitccpowerai:7392:7679 [0] graph/xml.h:85 NCCL WARN Attribute busid of node nic not found
vitccpowerai:7392:7679 [0] NCCL INFO graph/xml.cc:585 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO graph/xml.cc:767 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO graph/topo.cc:655 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO init.cc:840 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO init.cc:1358 -> 3
vitccpowerai:7392:7679 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
vitccpowerai:7392:7392 [0] NCCL INFO group.cc:406 -> 3
vitccpowerai:7392:7392 [0] NCCL INFO group.cc:96 -> 3
Traceback (most recent call last):
  File "/data/example.py", line 10, in <module>
    llm = LLM(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 146, in from_engine_args
    engine = cls(*engine_configs,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 103, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 60, in __init__
    self._init_workers_ray(placement_group)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 190, in _init_workers_ray
    self._run_workers("init_device",
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
    init_distributed_environment(self.parallel_config, self.rank,
  File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
    torch.distributed.all_reduce(torch.zeros(1).cuda())
  File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Traceback (most recent call last):
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/ray_utils.py", line 38, in execute_method
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     return executor(*args, **kwargs)
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     torch.distributed.all_reduce(torch.zeros(1).cuda())
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     return func(*args, **kwargs)
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45]     work = group.allreduce([tensor], opts)
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] ncclInternalError: Internal check failed.
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Last error:
(RayWorkerVllm pid=7500) ERROR 04-07 00:57:46 ray_utils.py:45] Attribute busid of node nic not found
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Traceback (most recent call last):
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/ray_utils.py", line 38, in execute_method
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     return executor(*args, **kwargs)
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     torch.distributed.all_reduce(torch.zeros(1).cuda())
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     return func(*args, **kwargs)
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45]     work = group.allreduce([tensor], opts)
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] ncclInternalError: Internal check failed.
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Last error:
(RayWorkerVllm pid=7551) ERROR 04-07 00:57:46 ray_utils.py:45] Attribute busid of node nic not found
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Traceback (most recent call last):
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/engine/ray_utils.py", line 38, in execute_method
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     return executor(*args, **kwargs)
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 92, in init_device
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/vllm-0.3.3+cu122-py3.10-linux-ppc64le.egg/vllm/worker/worker.py", line 284, in init_distributed_environment
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     torch.distributed.all_reduce(torch.zeros(1).cuda())
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     return func(*args, **kwargs)
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45]     work = group.allreduce([tensor], opts)
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] ncclInternalError: Internal check failed.
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Last error:
(RayWorkerVllm pid=7602) ERROR 04-07 00:57:46 ray_utils.py:45] Attribute busid of node nic not found

System Env:

root@vitccpowerai:/data# python3 collect_env.py 
Collecting environment information...
PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (ppc64le)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.35

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 16:04:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-101-generic-ppc64le-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture:                       ppc64le
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Model name:                         POWER9, altivec supported
Model:                              2.2 (pvr 004e 1202)
Thread(s) per core:                 4
Core(s) per socket:                 16
Socket(s):                          2
Frequency boost:                    enabled
CPU max MHz:                        3800.0000
CPU min MHz:                        2300.0000
L1d cache:                          1 MiB (32 instances)
L1i cache:                          1 MiB (32 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           160 MiB (16 instances)
NUMA node(s):                       6
NUMA node0 CPU(s):                  0-63
NUMA node8 CPU(s):                  64-127
NUMA node252 CPU(s):                
NUMA node253 CPU(s):                
NUMA node254 CPU(s):                
NUMA node255 CPU(s):                
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Mitigation; RFI Flush, L1D private per thread
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Mitigation; RFI Flush, L1D private per thread
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Vulnerability Spectre v2:           Mitigation; Indirect branch serialisation (kernel only)
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] onnx==1.16.0
[pip3] onnxruntime-gpu==1.15.1
[pip3] torch==2.1.2
[conda] cudatoolkit               11.8.0              hedcfb66_13    conda-forge
[conda] libmagma                  2.7.2                he288b6c_2    conda-forge
[conda] libmagma_sparse           2.7.2                h5b5c57a_3    conda-forge
[conda] magma                     2.7.2                h097a1ca_3    conda-forge
[conda] numpy                     1.24.3          py310h87cc683_0  
[conda] numpy-base                1.24.3          py310hac71eb6_0  
[conda] torch                     2.1.2                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: 7.0; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV3     SYS     SYS     0-63    0               N/A
GPU1    NV3      X      SYS     SYS     0-63    0               N/A
GPU2    SYS     SYS      X      NV3     64-127  8               N/A
GPU3    SYS     SYS     NV3      X      64-127  8               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

! This script runs smoothly when i manually create a single container. But when it is created as a part ray-cluster launch, its not working (i logged in to the head node and if i run, actually there was only head node in my config) ray-cluster.yaml

# This YAML file contains the configuration for a Ray cluster.
# It specifies the cluster name, provider type, IP addresses of the head and worker nodes,
cluster_name: xxx

# Run ray in containers
docker: 
  image: "ml-service"
  container_name: "ml-service"
  pull_before_run: false
  run_options:
    - --runtime=nvidia
    - --gpus all
    - --ipc=host
    - --privileged
    - -v "/data/xxx":"/data"
    - -p 8000:8000
    - --shm-size=128gb

# The 'provider' section specifies the type of provider and the IP addresses of the head node and worker nodes.
provider:
  type: local
  head_ip: xxx.xxx.xxx.xxx
  worker_ips: []

auth:
  ssh_user: root  # The SSH username for authentication
  ssh_private_key: ~/.ssh/id_rsa

min_workers: 0  # Minimum number of workers in the cluster
max_workers: 0  # Maximum number of workers in the cluster
upscaling_speed: 1.0  # Speed at which the cluster scales up
idle_timeout_minutes: 3  # Timeout in minutes for idle workers to be terminated

file_mounts: {
  "/app":"."  
}

rsync_exclude:
  - "**/.git"
  - "**/.git/**"
  - "*.tar.*"

rsync_filter:
  - ".gitignore"
  - "__pycache__"
  - "=*"
  - "*.pyc"
  - "*.tar.*"
  - ".git"

file_mounts_sync_continuously: true

# This YAML file contains the configuration for starting and stopping Ray clusters.
head_start_ray_commands:
  - ray stop
  - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0  --object-manager-port=8076

# The commands to start Ray workers in the cluster.
worker_start_ray_commands:
  - ray stop
  - export RAY_HEAD_IP && echo "export RAY_HEAD_IP=$RAY_HEAD_IP" >> ~/.bashrc && ray start --address=$RAY_HEAD_IP:6379   --object-manager-port=8076

Versions / Dependencies

ray version : 2.6.3 python: 3.10.13 torch: 2.1.2 vllm : 0.3.3

os : ubuntu-22.04::ppc64le

Reproduction script

example.py

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="/data/yi-34b", 
    dtype="float16", 
    tensor_parallel_size=4, 
    enforce_eager=True, 
    trust_remote_code=True, 
    load_format='safetensors',
    # quantization="AWQ",
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Running this script inside the ray-cluster started container. (works find without any error if i run manually by creating a container from the same image)

Issue Severity

High - Blocking My Project

ray-project / ray