Closed fourfireM closed 3 weeks ago
Hi @fourfireM - thanks for the report. To help debug:
kubctl get pods
? If some pods show as failed, can you paste the output of kubectl describe pod <pod-name>
?I don't see any error messages displayed, and I can always connect to my cluster via ssh, and I believe there should be no configuration issues, as I have no problem running my training program on a separate node.
Also using kubctl get pods
is normal showing the following:
NAME READY STATUS RESTARTS AGE
mycluster-d8f4-ray-head 1/1 Running 0 14h
mycluster-d8f4-ray-worker-pkwkl 1/1 Running 0 14h
sky-ssh-jump-d8f42d5f 1/1 Running 0 4d16h
and my provision log is in attatch:
2024-01-09 08:52:58,123 WARNING util.py:252 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2024-01-09 08:52:58,124 WARNING util.py:252 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
2024-01-09 08:52:58,124 INFO util.py:375 -- setting max workers for head node type to 0
2024-01-09 08:52:58,123 INFO commands.py:308 -- ␛[37mCluster␛[39m: ␛[1mmycluster-d8f4␛[22m
2024-01-09 08:52:58,477 INFO commands.py:385 -- Checking External environment settings
I 01-09 08:52:58 config.py:327] KubernetesNodeProvider: updating existing service "mycluster-d8f4-ray-head-ssh"
I 01-09 08:52:58 config.py:327] KubernetesNodeProvider: updating existing service "mycluster-d8f4-ray-head"
I 01-09 08:52:58 kubernetes_utils.py:886] SSH Jump ServiceAccount already exists in the cluster, using it.
I 01-09 08:52:58 kubernetes_utils.py:898] SSH Jump Role already exists in the cluster, using it.
I 01-09 08:52:58 kubernetes_utils.py:910] SSH Jump RoleBinding already exists in the cluster, using it.
I 01-09 08:52:58 kubernetes_utils.py:923] SSH Jump Host sky-ssh-jump-d8f42d5f already exists in the cluster, using it.
I 01-09 08:52:58 config.py:199] KubernetesNodeProvider: using existing autoscaler_service_account "skypilot-service-account"
I 01-09 08:52:58 config.py:226] KubernetesNodeProvider: using existing autoscaler_role "skypilot-service-account-role"
I 01-09 08:52:58 config.py:260] KubernetesNodeProvider: using existing autoscaler_role_binding "skypilot-service-account-role-binding"
Warning: Permanently added '10.42.0.36' (ECDSA) to the list of known hosts.
16:53:01 up 39 days, 15:29, 1 user, load average: 1.70, 1.86, 1.90
Shared connection to 10.42.0.36 closed.
2024-01-09 08:52:58,742 INFO commands.py:705 -- Cluster Ray runtime will not be restarted due to `␛[1m--no-restart␛[22m␛[26m`.
2024-01-09 08:52:58,742 INFO commands.py:709 -- Updating cluster configuration and running setup commands. ␛[4mConfirm [y/N]:␛[24m y ␛[2m[automatic, due to --yes]␛[22m
Skipped creating a new head node.
2024-01-09 08:52:58,742 INFO commands.py:778 -- ␛[2m<1/1>␛[22m ␛[36mSetting up head node␛[39m
2024-01-09 08:52:58,744 INFO commands.py:799 -- Prepared bootstrap config
2024-01-09 08:52:58,761 INFO updater.py:324 -- ␛[37mNew status␛[39m: ␛[1mwaiting-for-ssh␛[22m
2024-01-09 08:52:58,762 INFO updater.py:261 -- ␛[2m[1/7]␛[22m ␛[36mWaiting for SSH to become available␛[39m
2024-01-09 08:52:58,762 INFO updater.py:266 -- Running `␛[1muptime␛[22m␛[26m` as a test.
2024-01-09 08:52:58,773 INFO command_runner.py:204 -- ␛[37mFetched IP␛[39m: ␛[1m10.42.0.36␛[22m
2024-01-09 08:52:58,773 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Got IP [LogTimer=6ms]
2024-01-09 08:52:58,773 VINFO command_runner.py:371 -- Running `␛[1muptime␛[22m␛[26m`
2024-01-09 08:52:58,773 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' -o Port=22 -o ConnectTimeout=10s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'␛[22m␛[26m`
Shared connection to 10.42.0.36 closed.
2024-01-09 08:53:01,849 SUCC updater.py:280 -- ␛[32mSuccess.␛[39m
2024-01-09 08:53:01,849 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Got remote shell [LogTimer=3087ms]
2024-01-09 08:53:01,856 INFO updater.py:374 -- Updating cluster configuration.␛[0m␛[2m [hash=d70749231af87810bb73105e12cbf919596a5b9a]␛[22m␛[0m
2024-01-09 08:53:01,872 INFO updater.py:381 -- ␛[37mNew status␛[39m: ␛[1msyncing-files␛[22m
2024-01-09 08:53:01,873 INFO updater.py:238 -- ␛[2m[2/7]␛[22m ␛[36mProcessing file mounts␛[39m
2024-01-09 08:53:01,873 VINFO command_runner.py:371 -- Running `␛[1mmkdir -p ~/.sky/.runtime_files␛[22m␛[26m`
2024-01-09 08:53:01,873 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~/.sky/.runtime_files)'␛[22m␛[26m`
sending incremental file list
./
3ec3c333-9fe0-4a8f-a70a-486e2b386a00
cf929298-471e-46a2-b3a6-b6c2b6c85152
0207063b-b3d9-43d0-9b85-8a6cf9de56ee/
0207063b-b3d9-43d0-9b85-8a6cf9de56ee/skypilot-1.0.0.dev0-py3-none-any.whl
sent 795,540 bytes received 88 bytes 530,418.67 bytes/sec
total size is 821,111 speedup is 1.03
2024-01-09 08:53:02,977 VINFO command_runner.py:414 -- Running `␛[1mrsync --rsh ssh -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o "ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' " -o Port=22 -o ConnectTimeout=120s -avz /tmp/tmpp62_uhbu/ sky@10.42.0.36:~/.sky/.runtime_files/␛[22m␛[26m`
Shared connection to 10.42.0.36 closed.
2024-01-09 08:53:03,103 VINFO updater.py:536 -- `rsync`ed ␛[1m/tmp/tmpp62_uhbu/␛[22m␛[26m (local) to ␛[1m~/.sky/.runtime_files/␛[22m␛[26m (remote)
2024-01-09 08:53:03,103 INFO updater.py:233 -- ␛[1m~/.sky/.runtime_files/␛[22m␛[26m from ␛[1m/tmp/tmpp62_uhbu/␛[22m␛[26m
2024-01-09 08:53:03,103 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Synced /tmp/tmpp62_uhbu/ to ~/.sky/.runtime_files/ [LogTimer=1230ms]
2024-01-09 08:53:03,103 VINFO command_runner.py:371 -- Running `␛[1mmkdir -p ~␛[22m␛[26m`
2024-01-09 08:53:03,103 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~)'␛[22m␛[26m`
sending incremental file list
ray-bootstrap-eo3xo9hr
sent 823 bytes received 155 bytes 1,956.00 bytes/sec
total size is 13,893 speedup is 14.21
2024-01-09 08:53:04,221 VINFO command_runner.py:414 -- Running `␛[1mrsync --rsh ssh -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o "ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' " -o Port=22 -o ConnectTimeout=120s -avz /tmp/ray-bootstrap-eo3xo9hr sky@10.42.0.36:~/ray_bootstrap_config.yaml␛[22m␛[26m`
Shared connection to 10.42.0.36 closed.
2024-01-09 08:53:04,276 VINFO updater.py:536 -- `rsync`ed ␛[1m/tmp/ray-bootstrap-eo3xo9hr␛[22m␛[26m (local) to ␛[1m~/ray_bootstrap_config.yaml␛[22m␛[26m (remote)
2024-01-09 08:53:04,277 INFO updater.py:233 -- ␛[1m~/ray_bootstrap_config.yaml␛[22m␛[26m from ␛[1m/tmp/ray-bootstrap-eo3xo9hr␛[22m␛[26m
2024-01-09 08:53:04,277 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Synced /tmp/ray-bootstrap-eo3xo9hr to ~/ray_bootstrap_config.yaml [LogTimer=1174ms]
2024-01-09 08:53:04,277 VINFO command_runner.py:371 -- Running `␛[1mmkdir -p ~␛[22m␛[26m`
2024-01-09 08:53:04,277 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~)'␛[22m␛[26m`
sending incremental file list
sent 46 bytes received 12 bytes 116.00 bytes/sec
total size is 1,678 speedup is 28.93
2024-01-09 08:53:05,356 VINFO command_runner.py:414 -- Running `␛[1mrsync --rsh ssh -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o "ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' " -o Port=22 -o ConnectTimeout=120s -avz /root/.ssh/sky-key sky@10.42.0.36:~/ray_bootstrap_key.pem␛[22m␛[26m`
2024-01-09 08:53:05,410 VINFO updater.py:536 -- `rsync`ed ␛[1m/root/.ssh/sky-key␛[22m␛[26m (local) to ␛[1m~/ray_bootstrap_key.pem␛[22m␛[26m (remote)
2024-01-09 08:53:05,410 INFO updater.py:233 -- ␛[1m~/ray_bootstrap_key.pem␛[22m␛[26m from ␛[1m/root/.ssh/sky-key␛[22m␛[26m
2024-01-09 08:53:05,411 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Synced /root/.ssh/sky-key to ~/ray_bootstrap_key.pem [LogTimer=1134ms]
2024-01-09 08:53:05,411 INFO updater.py:255 -- ␛[2m[3/7]␛[22m No worker file mounts to sync
2024-01-09 08:53:05,430 INFO updater.py:392 -- ␛[37mNew status␛[39m: ␛[1msetting-up␛[22m
2024-01-09 08:53:05,430 INFO updater.py:433 -- ␛[2m[4/7]␛[22m No initialization commands to run.
2024-01-09 08:53:05,431 INFO updater.py:437 -- ␛[2m[5/7]␛[22m ␛[36mInitializing command runner␛[39m
2024-01-09 08:53:05,431 INFO updater.py:448 -- ␛[2m[6/7]␛[22m ␛[36mRunning setup commands␛[39m
2024-01-09 08:53:05,431 INFO updater.py:470 -- ␛[2m(0/1)␛[22m ␛[1m(mkdir -p ~/.sky && cp -r ~/.sky/.runtime_files/cf929298-471e-46a2-b3a6-b6c2b6c85152 ~/.sky/sky_ray.yml) && (mkdir -p ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52 && cp -r ~/.sky/.runtime_files/0207063b-b3d9-43d0-9b85-8a6cf9de56ee/* ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52) && (mkdir -p ~/.kube && cp -r ~/.sky/.runtime_files/3ec3c333-9fe0-4a8f-a70a-486e2b386a00 ~/.kube/config); mkdir -p ~/.ssh; touch ~/.ssh/config; pip3 --version > /dev/null 2>&1 || (curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py && echo "PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc); (type -a python | grep -q python3) || echo 'alias python=python3' >> ~/.bashrc; (type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && bash Miniconda3-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); which conda | grep /opt/conda || conda init > /dev/null; source ~/.bashrc; mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; (pip3 list | grep skypilot && [ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "ef67dfdf09e66a21bcf1a5727a379f52" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52/skypilot-1.0.0.dev0*.whl)[remote]" && echo "ef67dfdf09e66a21bcf1a5727a379f52" > ~/.sky/wheels/current_sky_wheel_hash || exit 1); sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'; sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload; mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config; python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1; [ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');␛[22m␛[26m
2024-01-09 08:53:05,431 VINFO command_runner.py:371 -- Running `␛[1m(mkdir -p ~/.sky && cp -r ~/.sky/.runtime_files/cf929298-471e-46a2-b3a6-b6c2b6c85152 ~/.sky/sky_ray.yml) && (mkdir -p ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52 && cp -r ~/.sky/.runtime_files/0207063b-b3d9-43d0-9b85-8a6cf9de56ee/* ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52) && (mkdir -p ~/.kube && cp -r ~/.sky/.runtime_files/3ec3c333-9fe0-4a8f-a70a-486e2b386a00 ~/.kube/config); mkdir -p ~/.ssh; touch ~/.ssh/config; pip3 --version > /dev/null 2>&1 || (curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py && echo "PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc); (type -a python | grep -q python3) || echo 'alias python=python3' >> ~/.bashrc; (type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && bash Miniconda3-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); which conda | grep /opt/conda || conda init > /dev/null; source ~/.bashrc; mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; (pip3 list | grep skypilot && [ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "ef67dfdf09e66a21bcf1a5727a379f52" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52/skypilot-1.0.0.dev0*.whl)[remote]" && echo "ef67dfdf09e66a21bcf1a5727a379f52" > ~/.sky/wheels/current_sky_wheel_hash || exit 1); sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'; sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload; mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config; python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1; [ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');␛[22m␛[26m`
␛[01;31m␛[Kskypilot␛[m␛[K 1.0.0.dev0
DefaultTasksMax=infinity
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/log_monitor.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/log_monitor.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py-v2.4.0.orig)
Shared connection to 10.42.0.36 closed.
␛[0m2024-01-09 08:53:05,431 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && ((mkdir -p ~/.sky && cp -r ~/.sky/.runtime_files/cf929298-471e-46a2-b3a6-b6c2b6c85152 ~/.sky/sky_ray.yml) && (mkdir -p ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52 && cp -r ~/.sky/.runtime_files/0207063b-b3d9-43d0-9b85-8a6cf9de56ee/* ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52) && (mkdir -p ~/.kube && cp -r ~/.sky/.runtime_files/3ec3c333-9fe0-4a8f-a70a-486e2b386a00 ~/.kube/config); mkdir -p ~/.ssh; touch ~/.ssh/config; pip3 --version > /dev/null 2>&1 || (curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py && echo "PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc); (type -a python | grep -q python3) || echo '"'"'alias python=python3'"'"' >> ~/.bashrc; (type -a pip | grep -q pip3) || echo '"'"'alias pip=pip3'"'"' >> ~/.bashrc; which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && bash Miniconda3-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); which conda | grep /opt/conda || conda init > /dev/null; source ~/.bashrc; mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; (pip3 list | grep skypilot && [ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "ef67dfdf09e66a21bcf1a5727a379f52" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52/skypilot-1.0.0.dev0*.whl)[remote]" && echo "ef67dfdf09e66a21bcf1a5727a379f52" > ~/.sky/wheels/current_sky_wheel_hash || exit 1); sudo bash -c '"'"'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'"'"'; sudo grep -e '"'"'^DefaultTasksMax'"'"' /etc/systemd/system.conf || (sudo bash -c '"'"'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'"'"'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload; mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config; python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1; [ -f /etc/fuse.conf ] && sudo sed -i '"'"'s/#user_allow_other/user_allow_other/g'"'"' /etc/fuse.conf || (sudo sh -c '"'"'echo "user_allow_other" > /etc/fuse.conf'"'"');)'␛[22m␛[26m`
2024-01-09 08:53:09,350 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Setup commands succeeded [LogTimer=3920ms]
2024-01-09 08:53:09,350 INFO updater.py:489 -- ␛[2m[7/7]␛[22m ␛[36mStarting the Ray runtime␛[39m
2024-01-09 08:53:09,350 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Ray start commands succeeded [LogTimer=0ms]
2024-01-09 08:53:09,350 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Applied config d70749231af87810bb73105e12cbf919596a5b9a [LogTimer=10606ms]
2024-01-09 08:53:09,370 INFO updater.py:188 -- ␛[37mNew status␛[39m: ␛[1mup-to-date␛[22m
2024-01-09 08:53:09,376 INFO commands.py:868 -- ␛[36mUseful commands␛[39m
2024-01-09 08:53:09,376 INFO commands.py:870 -- Monitor autoscaling with
2024-01-09 08:53:09,376 INFO commands.py:871 -- ␛[1m ray exec /root/.sky/generated/mycluster.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'␛[22m
2024-01-09 08:53:09,376 INFO commands.py:878 -- Connect to a terminal on the cluster head:
2024-01-09 08:53:09,376 INFO commands.py:879 -- ␛[1m ray attach /root/.sky/generated/mycluster.yml␛[22m
2024-01-09 08:53:09,376 INFO commands.py:882 -- Get a remote shell to the cluster manually:
2024-01-09 08:53:09,376 INFO commands.py:883 -- ssh -o IdentitiesOnly=yes -i ~/.ssh/sky-key sky@10.42.0.36
Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.42.0.36' (ECDSA) to the list of known hosts.
======== Autoscaler status: 2024-01-08 16:53:10.671846 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
1 ray_worker_default
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/8.0 A10
0.0/140.0 CPU
0.0/8.0 GPU
0B/560.00GiB memory
0B/953.67MiB object_store_memory
Demands:
(no resource demands)
and my yaml is like this:
resources:
# Optional; if left out, automatically pick the cheapest cloud.
cloud: kubernetes
# 1x NVIDIA V100 GPU
accelerators: a10:4
num_nodes: 2
# file_mounts:
# /datasets: /nas/data3/public/public_data/coco
# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: /root/SparseR-CNN
# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
echo "Running setup."
# git clone https://github.com/PeizeSun/SparseR-CNN.git
pip install torch==2.0.0 torchvision==0.15.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install scipy
python setup.py build develop
# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
echo "Hello, SkyPilot!"
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
port=12355
url=tcp://${master_addr}:${port}
echo $master_addr
echo $url
export OMP_NUM_THREADS=70
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
export NCCL_DEBUG=INFO
export GLOO_SOCKET_IFNAME=eth0
python projects/SparseRCNN/train_net.py --num-gpus ${SKYPILOT_NUM_GPUS_PER_NODE} \
--config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml \
--machine-rank ${SKYPILOT_NODE_RANK} --num-machines $num_nodes --dist-url $url
and it is a On-Prem Kubernetes Deployment
In fact, normally speaking, when I run my training script after loading the model, it starts training, and some of the outputs under normal circumstances are as follows: (Works fine for both single node and local multi-machine training )
[01/08 17:01:22 d2.checkpoint.c2_model_loading]: The checkpoint state_dict contains keys that are not used by the model:
(head, rank=0, pid=2701) stem.fc.{bias, weight}
[01/08 17:22:51 d2.engine.train_loop]: Starting training from iteration 0
then here is the training information .... loss....
but now it will stop after loading the model like
[01/08 17:01:22 d2.checkpoint.c2_model_loading]: The checkpoint state_dict contains keys that are not used by the model:
(head, rank=0, pid=2701) stem.fc.{bias, weight}
(There's no output, but I'm monitoring the GPU and CPU resource as if it's still running.)
IMO, I suspect it's a cpu limit issue
However when I launch the 70cpus cluster and export OMP_NUM_THREADS=70
in yaml, it still doesn't seem to have changed.
Thanks for the details @fourfireM - I haven't come across this issue before. Can you try bumping up memory and cpu using --memory
and --cpus
flags to sky launch
?
Meanwhile, I will also try reproducing it. Should I try using your YAML with code from https://github.com/PeizeSun/SparseR-CNN.git and run sky launch task.yaml
? Or are there any other steps I should follow?
Thank you for your answer and help, I am using the code in https://github.com/PeizeSun/SparseR-CNN.git to try to run the training, the run:
in yaml is also based on SparseR-CNN. I would appreciate if you could try the code to run.
In fact there is nothing wrong with the project code itself, it just seems to be too slow using multiple nodes, I tried to let the training run all the time and it took about 8 hours to run the first iteration of the first epoch.
Is there anything I can do to improve the multi-node training, by the way I've tried using --memory
and --cpus
flags to sky launch
like you said.
Hmm that's interesting. I've previously run multi-gpu multi-node training (nemo example) on GKE cluster and I recall it working fine.
@landscapepainter is investigating this now. One blind guess would be to check would be if ulimits are set too low, and if increasing them helps.
Thanks for the reply, for reference this is my training output on single node:time: 0.4544
(worker1, rank=1, pid=1728, ip=10.42.2.30) [01/09 02:37:02 d2.utils.events]: eta: 1 day, 10:05:09 iter: 19 total_loss: 60.7 loss_ce: 2.071 loss_giou: 2.395 loss_bbox: 8.479 loss_ce_0: 2.114 loss_giou_0: 2.014 loss_bbox_0: 4.383 loss_ce_1: 2.232
loss_giou_1: 1.957 loss_bbox_1: 4.273 loss_ce_2: 2.185 loss_giou_2: 2.248 loss_bbox_2: 4.196 loss_ce_3: 2.286 loss_giou_3: 2.2 loss_bbox_3: 6.94 loss_ce_4: 2.15 loss_giou_4: 2.238 loss_bbox_4: 6.638 time: 0.4544 data_time: 0.3754 lr: 7.2025e-07 max_mem: 2873M
This is my training output on multiple nodes: time: 1386.3151
(head, rank=0, pid=8259) [01/09 10:46:29 d2.utils.events]: eta: 4333 days, 21:08:17 iter: 19 total_loss: 60.84 loss_ce: 2.087 loss_giou: 2.371 loss_bbox: 8.264 loss_ce_0: 2.09 loss_giou_0: 2.044 loss_bbox_0: 4.391 loss_ce_1: 2.253
loss_giou_1: 2.033 loss_bbox_1: 4.141 loss_ce_2: 2.209 loss_giou_2: 2.225 loss_bbox_2: 4.027 loss_ce_3: 2.306 loss_giou_3: 2.27 loss_bbox_3: 6.824 loss_ce_4: 2.081 loss_giou_4: 2.221 loss_bbox_4: 6.377 time: 1386.3151 data_time: 0.1655 lr: 7.2025e-07 max_mem: 2907
We can see that the training time is off by a very large factor.
In addition, I used ssh to enter the cluster head node, and then typed in ulimit -a
and got the following output,
(base) sky@mycluster-d8f4-ray-head:~$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2579663
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
so I hope this can serve as some kind of reference.
@fourfireM Thanks for reporting this issue! Do you mind sharing which version of torch
and torchvision
is currently being used in your setting? I was trying to reproduce with the Task YAML you provided, which installs torch
and torchvision
with pip install torch==2.0.0 torchvision==0.15.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
. But I encountered an error running this YAML due to some torchvision ImportError
. I'm assuming the version of pytorch and torchvision being used are different. It can be discovered by running:
import torch
print(torch.__version__)
import torchvision
print(torchvision.__version__)
@landscapepainter In fact, -i https://pypi.tuna.tsinghua.edu.cn/simple
is the mirror source I used for the download, you could probably just remove it for the download.
For reference my version of torch and torchvision is as follows:
>>> import torch
>>> import torchvision
>>> print(torch.__version__)
2.0.0+cu117
>>> print(torchvision.__version__)
0.15.1+cu117
@fourfireM Thanks for the confirmation! Seems like we are using the same versions. I'm still trying to reproduce your error. I have one suggestion and a question:
Suggestion: ray job scheduler uses CUDA_VISIBLE_DEVICES
to assign jobs on those GPUs, so it's possible that more than training job is running on those GPUs if you specifiy export CUDA_VISIBLE_DEVICES=0,1,2,3
. Do you mind trying out after removing those commands to see if there's a speed boost?
Question: While trying to reproduce your result, I'm encountering the following AttributeError
and it seems like torch
2.0.0
does not have the attribute, _sync_params_and_buffers
, for DistributedDataParallel
anymore. What was your work around?
(head, rank=0, pid=2972) Traceback (most recent call last):
(head, rank=0, pid=2972) File "/home/sky/sky_workdir/projects/SparseRCNN/train_net.py", line 153, in <module>
(head, rank=0, pid=2972) launch(
(head, rank=0, pid=2972) File "/home/sky/sky_workdir/detectron2/engine/launch.py", line 55, in launch
(head, rank=0, pid=2972) mp.spawn(
(head, rank=0, pid=2972) File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
(head, rank=0, pid=2972) return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
(head, rank=0, pid=2972) File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
(head, rank=0, pid=2972) while not context.join():
(head, rank=0, pid=2972) File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
(head, rank=0, pid=2972) raise ProcessRaisedException(msg, error_index, failed_process.pid)
(head, rank=0, pid=2972) torch.multiprocessing.spawn.ProcessRaisedException:
(head, rank=0, pid=2972)
(head, rank=0, pid=2972) -- Process 1 terminated with the following error:
(head, rank=0, pid=2972) Traceback (most recent call last):
(head, rank=0, pid=2972) File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
(head, rank=0, pid=2972) fn(i, *args)
(head, rank=0, pid=2972) File "/home/sky/sky_workdir/detectron2/engine/launch.py", line 94, in _distributed_worker
(head, rank=0, pid=2972) main_func(*args)
(head, rank=0, pid=2972) File "/home/sky/sky_workdir/projects/SparseRCNN/train_net.py", line 146, in main
(head, rank=0, pid=2972) trainer.resume_or_load(resume=args.resume)
(head, rank=0, pid=2972) File "/home/sky/sky_workdir/detectron2/engine/defaults.py", line 334, in resume_or_load
(head, rank=0, pid=2972) self.model._sync_params_and_buffers()
(head, rank=0, pid=2972) File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
(head, rank=0, pid=2972) raise AttributeError("'{}' object has no attribute '{}'".format(
(head, rank=0, pid=2972) AttributeError: 'DistributedDataParallel' object has no attribute '_sync_params_and_buffers'
I have encountered this problem, it is because pytorch version is too high, my way to deal with it is to comment out lines 333-335 in detectron2/engine/defaults.py, like the following, after that the code will work fine. @landscapepainter
330 if isinstance(self.model, DistributedDataParallel):
331 # broadcast loaded data/model from the first rank, because other
332 # machines may not have access to the checkpoint file
333 # if TORCH_VERSION >= (1, 7):
334 # self.model._sync_params_and_buffers()
335 self.start_iter = comm.all_gather(self.start_iter)[0]
336
337 def build_hooks(self):
338 #... code....
Did you have the same problem as me after trying to run it? I'm debugging the code internally right now, and I'm finding that the multi-node run gets stuck on the model merge weights. @landscapepainter
@fourfireM Thanks for sharing what you discovered! I haven't made much progress yet since my last reply. I'll get back to this as soon as possible with what you provided in the previous comment. Meanwhile, if you happen to discover any other related problems, please share it with us! I absolutely appreciate your insights.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.
I'm trying to do distributed training with multiple machines and cards within a cluster after enabling a multi-node cluster using k8s, however my training will always stop before starting the training loop without reporting an error or any further indication, what could be the cause of this situation? I currently suspect too few cpu calls, what needs to be done to fix this? I set the export OMP_NUM_THREADS environment variable as per some of the tips below the question nothing seems to change.