ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.16k stars 5.8k forks source link

[Autoscaler, Serve] SSH commands not being delivered from head node to worker node #40898

Open astron8t-voyagerx opened 1 year ago

astron8t-voyagerx commented 1 year ago

What happened + What you expected to happen

I'm serving ML models using Ray serve on Google Cloud. After upgrading Ray version from 2.6.3 to 2.7.1, there is a bug when launching new worker nodes.

"file_mounts" are not being done, and also "initialization_commands" are not being executed on worker node. Despite this, launching a new node proceeds regardless of the state. However the problem arises when I assign a serve actor to the node. Since the necessary files and setups are absent, the actor crashes. Strangely, after several failed attempts to launch fresh nodes, the ssh commands are eventually executed successfully on an intermittent basis, and the actor is launched.

Versions / Dependencies

Python == 3.10 Ray == 2.7.1

Reproduction script

ray config is being generated as under

data = {
    "cluster_name": f"stable-diffusion-{RAY_STAGE}",
    "max_workers": 20,
    "upscaling_speed": 1.0,
    "idle_timeout_minutes": 1 if RAY_STAGE != "release" else 1,
    "provider": {
        "type": "gcp",
        "region": "asia-northeast3",
        "availability_zone": DIFFUSION_GOOGLE_CLOUD_ZONE,
        "cache_stopped_nodes": False,
        "project_id": #####
    },
    "auth": {
        "ssh_user": "ubuntu",
    },
    "available_node_types": {
        "ray_head_default": {
            "resources": {"Head": 1.0},
            "node_config": {
                "machineType": "e2-standard-2"
                if RAY_STAGE != "release"
                else "e2-highcpu-16",
                "disks": [
                    {
                        "boot": True,
                        "autoDelete": True,
                        "type": "PERSISTENT",
                        "initializeParams": {
                            "diskSizeGb": 20,
                            "sourceImage": #####
                        }
                    }
                ],
            },
        },
        "ray_worker_default": {   # This value can only contain lowercase letters, numeric characters, underscores and dashes. 
            "min_workers": 0 if RAY_STAGE != "release" else 1,
            "max_workers": 1 if RAY_STAGE != "release" else 4,
            "resources": {"CPU": 12.0, "GPU": 1.0},
            "node_config": {
                "machineType": "a2-highgpu-1g",
                "disks": [
                    {
                        "boot": True,
                        "autoDelete": True,
                        "type": "PERSISTENT",
                        "initializeParams": {
                            "diskSizeGb": 40,
                            "sourceImage": #####
                        }
                    }
                        ],
                "scheduling": {
                    "onHostMaintenance": "TERMINATE"
                },
                "serviceAccounts": [
                    {
                        "email": "ray-autoscaler-sa-v1@#####.iam.gserviceaccount.com",
                        "scopes": [
                            "https://www.googleapis.com/auth/cloud-platform"
                        ]
                    }
                ]
            },
        },
    },
    "head_node_type": "ray_head_default",
    "file_mounts": {
        "~/service": "./service",
    },
    "cluster_synced_files": [],
    "file_mounts_sync_continuously": False,
    "rsync_exclude": [
        "**/.git",
        "**/.git/**",
    ],
    "rsync_filter": [
        ".gitignore",
    ],
    "initialization_commands": [
    ],
    "setup_commands": [
        ],
    "head_setup_commands": [
        f'echo "export STAGE={RAY_STAGE}" >> ~/.bashrc && echo "export RAY_STAGE={RAY_STAGE}" >> ~/.bashrc && echo "export EVENT_QUEUE={os.environ["EVENT_QUEUE"]}" >> ~/.bashrc && echo "export TRACE_ROUTE={os.environ["RAY_TRACE_ROUTE"]}" >> ~/.bashrc',
        'echo "export HEAD_INSTANCE_NAME=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/name")" >> ~/.bashrc',
        'echo "export INSTANCE_IP=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip")" >> ~/.bashrc',
        f'echo "export RAY_GRAFANA_IFRAME_HOST=http://${{INSTANCE_IP}}:3000" >> ~/.bashrc',
        "source ~/.bashrc",
        "cp -rp ./service/* ./",
        ],
    "worker_setup_commands": [
        f'echo "export STAGE={RAY_STAGE}" >> ~/.bashrc && echo "export RAY_STAGE={RAY_STAGE}" >> ~/.bashrc && echo "export EVENT_QUEUE={os.environ["EVENT_QUEUE"]}" >> ~/.bashrc && echo "export TRACE_ROUTE={os.environ["RAY_TRACE_ROUTE"]}" >> ~/.bashrc',
        "source ~/.bashrc",
        "cp -rp ./service/* ./",
        'echo "export INSTANCE_NAME=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/name")" >> ~/.bashrc',
        ],
    "head_start_ray_commands": [
        "ray stop",
        "RAY_ROTATION_MAX_BYTES=256000 RAY_ROTATION_BACKUP_COUNT=0 ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0",
        "sudo cp ray-prometheus-server.service /etc/systemd/system/ray-prometheus-server.service && sudo cp ray-grafana-server.service /etc/systemd/system/ray-grafana-server.service && sudo systemctl daemon-reload",
        "sudo systemctl stop ray-prometheus-server.service && sudo systemctl start ray-prometheus-server.service",
        "sudo systemctl stop ray-grafana-server.service && sudo systemctl start ray-grafana-server.service",
    ],
    "worker_start_ray_commands": [
        "ray stop",
        "RAY_ROTATION_MAX_BYTES=256000 RAY_ROTATION_BACKUP_COUNT=0 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076",
        f'gcloud compute instance-groups unmanaged add-instances {DIFFUSION_TARGET_INSTANCE_GROUP} --instances=$INSTANCE_NAME --zone={DIFFUSION_GOOGLE_CLOUD_ZONE} || true'
    ],
}

Issue Severity

High: It blocks me from completing my task.

architkulkarni commented 1 year ago

@astron8t-voyagerx Thanks for reporting this, do you have a minimal way to reproduce this issue? How often does it happen?

Also, are you certain that the issue goes away when you go back to Ray 2.6? The reason I ask is there are some similar-sounding issues https://github.com/ray-project/ray/issues/39565 and https://github.com/ray-project/ray/issues/38718, but those were reported to also be present in Ray 2.6, not just 2.7. We haven't been able to reproduce the issue so far

astron8t-voyagerx commented 10 months ago

Yes, this issue does not happen when using Ray 2.6. It happens on any version after that, on 2.7, 2.8, and 2.9. What I've found so far is that this happens when the node provider attempts to launch two or more nodes at once. This issue does not reproduce when it launchs only one node at once. I've checked /session_latest/logs/monitor.out for debugging.

2024-01-19 09:42:14,235 INFO updater.py:329 -- New status: waiting-for-ssh
2024-01-19 09:42:14,235 INFO updater.py:266 -- [1/7] Waiting for SSH to become available
2024-01-19 09:42:14,235 INFO updater.py:271 -- Running `uptime` as a test.
2024-01-19 09:42:14,235 INFO command_runner.py:204 -- Fetched IP: 10.178.0.88
2024-01-19 09:42:14,235 INFO log_timer.py:25 -- NodeUpdater: ray-diffusion-beta0-worker-8fb039f1-compute: Got IP  [LogTimer=0ms]
2024-01-19 09:42:14,236 VINFO command_runner.py:371 -- Running `uptime`
2024-01-19 09:42:14,236 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.178.0.88 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2024-01-19 09:42:17,361 INFO updater.py:317 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-01-19 09:42:20,152 INFO updater.py:329 -- New status: waiting-for-ssh
2024-01-19 09:42:20,152 INFO updater.py:266 -- [1/7] Waiting for SSH to become available
2024-01-19 09:42:20,152 INFO updater.py:271 -- Running `uptime` as a test.
2024-01-19 09:42:20,152 INFO command_runner.py:204 -- Fetched IP: 10.178.0.61
2024-01-19 09:42:20,152 INFO log_timer.py:25 -- NodeUpdater: ray-diffusion-beta0-worker-c3457c69-compute: Got IP  [LogTimer=0ms]
2024-01-19 09:42:20,153 VINFO command_runner.py:371 -- Running `uptime`
2024-01-19 09:42:20,153 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.178.0.61 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2024-01-19 09:42:20,165 INFO updater.py:317 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-01-19 09:42:22,366 VINFO command_runner.py:371 -- Running `uptime`
2024-01-19 09:42:22,366 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.178.0.88 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2024-01-19 09:42:22,377 INFO updater.py:317 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-01-19 09:42:25,172 VINFO command_runner.py:371 -- Running `uptime`
2024-01-19 09:42:25,172 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.178.0.61 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2024-01-19 09:42:25,191 INFO updater.py:317 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-01-19 09:42:27,379 VINFO command_runner.py:371 -- Running `uptime`
2024-01-19 09:42:27,380 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.178.0.88 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2024-01-19 09:42:27,391 INFO updater.py:317 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-01-19 09:42:30,195 VINFO command_runner.py:371 -- Running `uptime`
2024-01-19 09:42:30,195 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.178.0.61 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
 09:42:32 up 0 min,  1 user,  load average: 0.52, 0.13, 0.04
2024-01-19 09:42:32,395 VINFO command_runner.py:371 -- Running `uptime`
2024-01-19 09:42:32,395 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.178.0.88 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2024-01-19 09:42:32,397 SUCC updater.py:285 -- Success.
2024-01-19 09:42:32,397 INFO log_timer.py:25 -- NodeUpdater: ray-diffusion-beta0-worker-c3457c69-compute: Got remote shell  [LogTimer=12245ms]
2024-01-19 09:42:32,397 INFO updater.py:379 -- Updating cluster configuration. [hash=6403cebb59737a1dcf0d5502a51c03f20eccf17f]
 09:42:34 up 0 min,  1 user,  load average: 0.27, 0.06, 0.02
2024-01-19 09:42:34,502 SUCC updater.py:285 -- Success.
2024-01-19 09:42:34,502 INFO log_timer.py:25 -- NodeUpdater: ray-diffusion-beta0-worker-8fb039f1-compute: Got remote shell  [LogTimer=20267ms]
2024-01-19 09:42:38,116 INFO updater.py:386 -- New status: syncing-files
2024-01-19 09:42:38,116 INFO updater.py:243 -- [2/7] Processing file mounts
2024-01-19 09:42:38,116 INFO updater.py:260 -- [3/7] No worker file mounts to sync
2024-01-19 09:42:43,848 INFO updater.py:397 -- New status: setting-up
2024-01-19 09:42:43,848 INFO updater.py:438 -- [4/7] No initialization commands to run.
2024-01-19 09:42:43,848 INFO updater.py:442 -- [5/7] Initializing command runner
2024-01-19 09:42:43,848 INFO updater.py:489 -- [6/7] No setup commands to run.
2024-01-19 09:42:43,848 INFO updater.py:494 -- [7/7] Starting the Ray runtime
2024-01-19 09:42:43,848 VINFO command_runner.py:371 -- Running `export RAY_OVERRIDE_RESOURCES='{"CPU":4,"GPU":1}';export RAY_HEAD_IP=10.178.0.9; ray stop`
2024-01-19 09:42:43,848 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.178.0.61 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":4,"GPU":1}'"'"';export RAY_HEAD_IP=10.178.0.9; ray stop)'`
2024-01-19 09:42:44,029 INFO updater.py:379 -- Updating cluster configuration. [hash=6403cebb59737a1dcf0d5502a51c03f20eccf17f]
Did not find any active Ray processes.
2024-01-19 09:42:46,856 VINFO command_runner.py:371 -- Running `export RAY_OVERRIDE_RESOURCES='{"CPU":4,"GPU":1}';export RAY_HEAD_IP=10.178.0.9; RAY_ROTATION_MAX_BYTES=256000 RAY_ROTATION_BACKUP_COUNT=0 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076`
2024-01-19 09:42:46,856 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.178.0.61 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":4,"GPU":1}'"'"';export RAY_HEAD_IP=10.178.0.9; RAY_ROTATION_MAX_BYTES=256000 RAY_ROTATION_BACKUP_COUNT=0 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076)'`
Local node IP: 10.178.0.61
[2024-01-19 09:42:48,690 I 1248 1248] global_state_accessor.cc:374: This node has an IP address of 10.178.0.61, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop
2024-01-19 09:42:48,883 VINFO command_runner.py:371 -- Running `export RAY_OVERRIDE_RESOURCES='{"CPU":4,"GPU":1}';export RAY_HEAD_IP=10.178.0.9; gcloud compute instance-groups unmanaged add-instances vrew-diffusion-beta --instances=$INSTANCE_NAME --zone=asia-northeast3-a || true`
2024-01-19 09:42:48,884 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.178.0.61 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":4,"GPU":1}'"'"';export RAY_HEAD_IP=10.178.0.9; gcloud compute instance-groups unmanaged add-instances vrew-diffusion-beta --instances=$INSTANCE_NAME --zone=asia-northeast3-a || true)'`
2024-01-19 09:42:49,753 INFO updater.py:386 -- New status: syncing-files
2024-01-19 09:42:49,754 INFO updater.py:243 -- [2/7] Processing file mounts
2024-01-19 09:42:49,754 INFO updater.py:260 -- [3/7] No worker file mounts to sync
ERROR: (gcloud.compute.instance-groups.unmanaged.add-instances) argument --instances: not enough args
Usage: gcloud compute instance-groups unmanaged add-instances NAME --instances=INSTANCE,[INSTANCE,...] [optional flags]
  optional flags may be  --help | --zone

For detailed information on this command and its flags, run:
  gcloud compute instance-groups unmanaged add-instances --help
2024-01-19 09:42:52,903 INFO log_timer.py:25 -- NodeUpdater: ray-diffusion-beta0-worker-c3457c69-compute: Ray start commands succeeded [LogTimer=9055ms]
2024-01-19 09:42:52,904 INFO log_timer.py:25 -- NodeUpdater: ray-diffusion-beta0-worker-c3457c69-compute: Applied config 6403cebb59737a1dcf0d5502a51c03f20eccf17f  [LogTimer=38457ms]
2024-01-19 09:42:55,707 INFO updater.py:397 -- New status: setting-up
2024-01-19 09:42:55,707 INFO updater.py:438 -- [4/7] No initialization commands to run.
2024-01-19 09:42:55,707 INFO updater.py:442 -- [5/7] Initializing command runner
2024-01-19 09:42:55,708 INFO updater.py:489 -- [6/7] No setup commands to run.
2024-01-19 09:42:55,708 INFO updater.py:494 -- [7/7] Starting the Ray runtime
2024-01-19 09:42:55,708 VINFO command_runner.py:371 -- Running `export RAY_OVERRIDE_RESOURCES='{"CPU":4,"GPU":1}';export RAY_HEAD_IP=10.178.0.9; ray stop`
2024-01-19 09:42:55,708 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.178.0.88 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":4,"GPU":1}'"'"';export RAY_HEAD_IP=10.178.0.9; ray stop)'`
Did not find any active Ray processes.
2024-01-19 09:42:58,841 VINFO command_runner.py:371 -- Running `export RAY_OVERRIDE_RESOURCES='{"CPU":4,"GPU":1}';export RAY_HEAD_IP=10.178.0.9; RAY_ROTATION_MAX_BYTES=256000 RAY_ROTATION_BACKUP_COUNT=0 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076`
2024-01-19 09:42:58,841 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/501043dfa6/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.178.0.88 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":4,"GPU":1}'"'"';export RAY_HEAD_IP=10.178.0.9; RAY_ROTATION_MAX_BYTES=256000 RAY_ROTATION_BACKUP_COUNT=0 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076)'`
Local node IP: 10.178.0.88
[2024-01-19 09:43:00,715 I 1266 1266] global_state_accessor.cc:374: This node has an IP address of 10.178.0.88, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.