[k8s] Can not running the train

fourfireM commented 10 months ago

I'm trying to do distributed training with multiple machines and cards within a cluster after enabling a multi-node cluster using k8s, however my training will always stop before starting the training loop without reporting an error or any further indication, what could be the cause of this situation? I currently suspect too few cpu calls, what needs to be done to fix this? I set the export OMP_NUM_THREADS environment variable as per some of the tips below the question nothing seems to change.

romilbhardwaj commented 10 months ago

Hi @fourfireM - thanks for the report. To help debug:

What is the failure mode? Do you see any error or are you unable to ssh/connect to the cluster anymore?
What's the output of kubctl get pods? If some pods show as failed, can you paste the output of kubectl describe pod <pod-name>?
Can you share provisioning (path to this file is printed during beginning of provisioning under "To view detailed progress:") and job logs, if any?
To help replication, can you share your task YAML here?
Is this a cloud Kubernetes cluster (GKE/EKS) or an onprem deployment?

fourfireM commented 10 months ago

I don't see any error messages displayed, and I can always connect to my cluster via ssh, and I believe there should be no configuration issues, as I have no problem running my training program on a separate node.

Also using kubctl get pods is normal showing the following: NAME READY STATUS RESTARTS AGE mycluster-d8f4-ray-head 1/1 Running 0 14h mycluster-d8f4-ray-worker-pkwkl 1/1 Running 0 14h sky-ssh-jump-d8f42d5f 1/1 Running 0 4d16h

and my provision log is in attatch:

2024-01-09 08:52:58,123 WARNING util.py:252 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2024-01-09 08:52:58,124 WARNING util.py:252 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
2024-01-09 08:52:58,124 INFO util.py:375 -- setting max workers for head node type to 0
2024-01-09 08:52:58,123 INFO commands.py:308 -- ␛[37mCluster␛[39m: ␛[1mmycluster-d8f4␛[22m
2024-01-09 08:52:58,477 INFO commands.py:385 -- Checking External environment settings
I 01-09 08:52:58 config.py:327] KubernetesNodeProvider: updating existing service "mycluster-d8f4-ray-head-ssh"
I 01-09 08:52:58 config.py:327] KubernetesNodeProvider: updating existing service "mycluster-d8f4-ray-head"
I 01-09 08:52:58 kubernetes_utils.py:886] SSH Jump ServiceAccount already exists in the cluster, using it.
I 01-09 08:52:58 kubernetes_utils.py:898] SSH Jump Role already exists in the cluster, using it.
I 01-09 08:52:58 kubernetes_utils.py:910] SSH Jump RoleBinding already exists in the cluster, using it.
I 01-09 08:52:58 kubernetes_utils.py:923] SSH Jump Host sky-ssh-jump-d8f42d5f already exists in the cluster, using it.
I 01-09 08:52:58 config.py:199] KubernetesNodeProvider: using existing autoscaler_service_account "skypilot-service-account"
I 01-09 08:52:58 config.py:226] KubernetesNodeProvider: using existing autoscaler_role "skypilot-service-account-role"
I 01-09 08:52:58 config.py:260] KubernetesNodeProvider: using existing autoscaler_role_binding "skypilot-service-account-role-binding"
Warning: Permanently added '10.42.0.36' (ECDSA) to the list of known hosts.
 16:53:01 up 39 days, 15:29,  1 user,  load average: 1.70, 1.86, 1.90
Shared connection to 10.42.0.36 closed.
2024-01-09 08:52:58,742 INFO commands.py:705 -- Cluster Ray runtime will not be restarted due to `␛[1m--no-restart␛[22m␛[26m`.
2024-01-09 08:52:58,742 INFO commands.py:709 -- Updating cluster configuration and running setup commands. ␛[4mConfirm [y/N]:␛[24m y ␛[2m[automatic, due to --yes]␛[22m
Skipped creating a new head node.
2024-01-09 08:52:58,742 INFO commands.py:778 -- ␛[2m<1/1>␛[22m ␛[36mSetting up head node␛[39m
2024-01-09 08:52:58,744 INFO commands.py:799 -- Prepared bootstrap config
2024-01-09 08:52:58,761 INFO updater.py:324 -- ␛[37mNew status␛[39m: ␛[1mwaiting-for-ssh␛[22m
2024-01-09 08:52:58,762 INFO updater.py:261 -- ␛[2m[1/7]␛[22m ␛[36mWaiting for SSH to become available␛[39m
2024-01-09 08:52:58,762 INFO updater.py:266 -- Running `␛[1muptime␛[22m␛[26m` as a test.
2024-01-09 08:52:58,773 INFO command_runner.py:204 -- ␛[37mFetched IP␛[39m: ␛[1m10.42.0.36␛[22m
2024-01-09 08:52:58,773 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Got IP  [LogTimer=6ms]
2024-01-09 08:52:58,773 VINFO command_runner.py:371 -- Running `␛[1muptime␛[22m␛[26m`
2024-01-09 08:52:58,773 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh'  -o Port=22 -o ConnectTimeout=10s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'␛[22m␛[26m`
Shared connection to 10.42.0.36 closed.
2024-01-09 08:53:01,849 SUCC updater.py:280 -- ␛[32mSuccess.␛[39m
2024-01-09 08:53:01,849 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Got remote shell  [LogTimer=3087ms]
2024-01-09 08:53:01,856 INFO updater.py:374 -- Updating cluster configuration.␛[0m␛[2m [hash=d70749231af87810bb73105e12cbf919596a5b9a]␛[22m␛[0m
2024-01-09 08:53:01,872 INFO updater.py:381 -- ␛[37mNew status␛[39m: ␛[1msyncing-files␛[22m
2024-01-09 08:53:01,873 INFO updater.py:238 -- ␛[2m[2/7]␛[22m ␛[36mProcessing file mounts␛[39m
2024-01-09 08:53:01,873 VINFO command_runner.py:371 -- Running `␛[1mmkdir -p ~/.sky/.runtime_files␛[22m␛[26m`
2024-01-09 08:53:01,873 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh'  -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~/.sky/.runtime_files)'␛[22m␛[26m`
sending incremental file list
./
3ec3c333-9fe0-4a8f-a70a-486e2b386a00
cf929298-471e-46a2-b3a6-b6c2b6c85152
0207063b-b3d9-43d0-9b85-8a6cf9de56ee/
0207063b-b3d9-43d0-9b85-8a6cf9de56ee/skypilot-1.0.0.dev0-py3-none-any.whl

sent 795,540 bytes  received 88 bytes  530,418.67 bytes/sec
total size is 821,111  speedup is 1.03
2024-01-09 08:53:02,977 VINFO command_runner.py:414 -- Running `␛[1mrsync --rsh ssh -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o "ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' " -o Port=22 -o ConnectTimeout=120s -avz /tmp/tmpp62_uhbu/ sky@10.42.0.36:~/.sky/.runtime_files/␛[22m␛[26m`
Shared connection to 10.42.0.36 closed.
2024-01-09 08:53:03,103 VINFO updater.py:536 -- `rsync`ed ␛[1m/tmp/tmpp62_uhbu/␛[22m␛[26m (local) to ␛[1m~/.sky/.runtime_files/␛[22m␛[26m (remote)
2024-01-09 08:53:03,103 INFO updater.py:233 -- ␛[1m~/.sky/.runtime_files/␛[22m␛[26m from ␛[1m/tmp/tmpp62_uhbu/␛[22m␛[26m
2024-01-09 08:53:03,103 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Synced /tmp/tmpp62_uhbu/ to ~/.sky/.runtime_files/  [LogTimer=1230ms]
2024-01-09 08:53:03,103 VINFO command_runner.py:371 -- Running `␛[1mmkdir -p ~␛[22m␛[26m`
2024-01-09 08:53:03,103 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh'  -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~)'␛[22m␛[26m`
sending incremental file list
ray-bootstrap-eo3xo9hr

sent 823 bytes  received 155 bytes  1,956.00 bytes/sec
total size is 13,893  speedup is 14.21
2024-01-09 08:53:04,221 VINFO command_runner.py:414 -- Running `␛[1mrsync --rsh ssh -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o "ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' " -o Port=22 -o ConnectTimeout=120s -avz /tmp/ray-bootstrap-eo3xo9hr sky@10.42.0.36:~/ray_bootstrap_config.yaml␛[22m␛[26m`
Shared connection to 10.42.0.36 closed.
2024-01-09 08:53:04,276 VINFO updater.py:536 -- `rsync`ed ␛[1m/tmp/ray-bootstrap-eo3xo9hr␛[22m␛[26m (local) to ␛[1m~/ray_bootstrap_config.yaml␛[22m␛[26m (remote)
2024-01-09 08:53:04,277 INFO updater.py:233 -- ␛[1m~/ray_bootstrap_config.yaml␛[22m␛[26m from ␛[1m/tmp/ray-bootstrap-eo3xo9hr␛[22m␛[26m
2024-01-09 08:53:04,277 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Synced /tmp/ray-bootstrap-eo3xo9hr to ~/ray_bootstrap_config.yaml  [LogTimer=1174ms]
2024-01-09 08:53:04,277 VINFO command_runner.py:371 -- Running `␛[1mmkdir -p ~␛[22m␛[26m`
2024-01-09 08:53:04,277 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh'  -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~)'␛[22m␛[26m`
sending incremental file list

sent 46 bytes  received 12 bytes  116.00 bytes/sec
total size is 1,678  speedup is 28.93
2024-01-09 08:53:05,356 VINFO command_runner.py:414 -- Running `␛[1mrsync --rsh ssh -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o "ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh' " -o Port=22 -o ConnectTimeout=120s -avz /root/.ssh/sky-key sky@10.42.0.36:~/ray_bootstrap_key.pem␛[22m␛[26m`
2024-01-09 08:53:05,410 VINFO updater.py:536 -- `rsync`ed ␛[1m/root/.ssh/sky-key␛[22m␛[26m (local) to ␛[1m~/ray_bootstrap_key.pem␛[22m␛[26m (remote)
2024-01-09 08:53:05,410 INFO updater.py:233 -- ␛[1m~/ray_bootstrap_key.pem␛[22m␛[26m from ␛[1m/root/.ssh/sky-key␛[22m␛[26m
2024-01-09 08:53:05,411 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Synced /root/.ssh/sky-key to ~/ray_bootstrap_key.pem  [LogTimer=1134ms]
2024-01-09 08:53:05,411 INFO updater.py:255 -- ␛[2m[3/7]␛[22m No worker file mounts to sync
2024-01-09 08:53:05,430 INFO updater.py:392 -- ␛[37mNew status␛[39m: ␛[1msetting-up␛[22m
2024-01-09 08:53:05,430 INFO updater.py:433 -- ␛[2m[4/7]␛[22m No initialization commands to run.
2024-01-09 08:53:05,431 INFO updater.py:437 -- ␛[2m[5/7]␛[22m ␛[36mInitializing command runner␛[39m
2024-01-09 08:53:05,431 INFO updater.py:448 -- ␛[2m[6/7]␛[22m ␛[36mRunning setup commands␛[39m
2024-01-09 08:53:05,431 INFO updater.py:470 -- ␛[2m(0/1)␛[22m ␛[1m(mkdir -p ~/.sky && cp -r ~/.sky/.runtime_files/cf929298-471e-46a2-b3a6-b6c2b6c85152 ~/.sky/sky_ray.yml) && (mkdir -p ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52 && cp -r ~/.sky/.runtime_files/0207063b-b3d9-43d0-9b85-8a6cf9de56ee/* ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52) && (mkdir -p ~/.kube && cp -r ~/.sky/.runtime_files/3ec3c333-9fe0-4a8f-a70a-486e2b386a00 ~/.kube/config); mkdir -p ~/.ssh; touch ~/.ssh/config; pip3 --version > /dev/null 2>&1 || (curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py && echo "PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc); (type -a python | grep -q python3) || echo 'alias python=python3' >> ~/.bashrc; (type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && bash Miniconda3-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); which conda | grep /opt/conda || conda init > /dev/null; source ~/.bashrc; mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; (pip3 list | grep skypilot && [ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "ef67dfdf09e66a21bcf1a5727a379f52" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52/skypilot-1.0.0.dev0*.whl)[remote]" && echo "ef67dfdf09e66a21bcf1a5727a379f52" > ~/.sky/wheels/current_sky_wheel_hash || exit 1); sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'; sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload; mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n  StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n  StrictHostKeyChecking no\n" >> ~/.ssh/config; python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1; [ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');␛[22m␛[26m
2024-01-09 08:53:05,431 VINFO command_runner.py:371 -- Running `␛[1m(mkdir -p ~/.sky && cp -r ~/.sky/.runtime_files/cf929298-471e-46a2-b3a6-b6c2b6c85152 ~/.sky/sky_ray.yml) && (mkdir -p ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52 && cp -r ~/.sky/.runtime_files/0207063b-b3d9-43d0-9b85-8a6cf9de56ee/* ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52) && (mkdir -p ~/.kube && cp -r ~/.sky/.runtime_files/3ec3c333-9fe0-4a8f-a70a-486e2b386a00 ~/.kube/config); mkdir -p ~/.ssh; touch ~/.ssh/config; pip3 --version > /dev/null 2>&1 || (curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py && echo "PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc); (type -a python | grep -q python3) || echo 'alias python=python3' >> ~/.bashrc; (type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && bash Miniconda3-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); which conda | grep /opt/conda || conda init > /dev/null; source ~/.bashrc; mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; (pip3 list | grep skypilot && [ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "ef67dfdf09e66a21bcf1a5727a379f52" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52/skypilot-1.0.0.dev0*.whl)[remote]" && echo "ef67dfdf09e66a21bcf1a5727a379f52" > ~/.sky/wheels/current_sky_wheel_hash || exit 1); sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'; sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload; mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n  StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n  StrictHostKeyChecking no\n" >> ~/.ssh/config; python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1; [ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');␛[22m␛[26m`
␛[01;31m␛[Kskypilot␛[m␛[K                               1.0.0.dev0
DefaultTasksMax=infinity
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/log_monitor.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/log_monitor.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py-v2.4.0.orig)
patching file /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py (read from /home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py-v2.4.0.orig)
Shared connection to 10.42.0.36 closed.
␛[0m2024-01-09 08:53:05,431 VVINFO command_runner.py:373 -- Full command is `␛[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/6b7365ba77/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/port-forward-proxy-cmd.sh'  -o Port=22 -o ConnectTimeout=120s sky@10.42.0.36 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && ((mkdir -p ~/.sky && cp -r ~/.sky/.runtime_files/cf929298-471e-46a2-b3a6-b6c2b6c85152 ~/.sky/sky_ray.yml) && (mkdir -p ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52 && cp -r ~/.sky/.runtime_files/0207063b-b3d9-43d0-9b85-8a6cf9de56ee/* ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52) && (mkdir -p ~/.kube && cp -r ~/.sky/.runtime_files/3ec3c333-9fe0-4a8f-a70a-486e2b386a00 ~/.kube/config); mkdir -p ~/.ssh; touch ~/.ssh/config; pip3 --version > /dev/null 2>&1 || (curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py && echo "PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc); (type -a python | grep -q python3) || echo '"'"'alias python=python3'"'"' >> ~/.bashrc; (type -a pip | grep -q pip3) || echo '"'"'alias pip=pip3'"'"' >> ~/.bashrc; which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && bash Miniconda3-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); which conda | grep /opt/conda || conda init > /dev/null; source ~/.bashrc; mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; (pip3 list | grep skypilot && [ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "ef67dfdf09e66a21bcf1a5727a379f52" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo ~/.sky/wheels/ef67dfdf09e66a21bcf1a5727a379f52/skypilot-1.0.0.dev0*.whl)[remote]" && echo "ef67dfdf09e66a21bcf1a5727a379f52" > ~/.sky/wheels/current_sky_wheel_hash || exit 1); sudo bash -c '"'"'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'"'"'; sudo grep -e '"'"'^DefaultTasksMax'"'"' /etc/systemd/system.conf || (sudo bash -c '"'"'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'"'"'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload; mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n  StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n  StrictHostKeyChecking no\n" >> ~/.ssh/config; python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1; [ -f /etc/fuse.conf ] && sudo sed -i '"'"'s/#user_allow_other/user_allow_other/g'"'"' /etc/fuse.conf || (sudo sh -c '"'"'echo "user_allow_other" > /etc/fuse.conf'"'"');)'␛[22m␛[26m`
2024-01-09 08:53:09,350 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Setup commands succeeded [LogTimer=3920ms]
2024-01-09 08:53:09,350 INFO updater.py:489 -- ␛[2m[7/7]␛[22m ␛[36mStarting the Ray runtime␛[39m
2024-01-09 08:53:09,350 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Ray start commands succeeded [LogTimer=0ms]
2024-01-09 08:53:09,350 INFO log_timer.py:25 -- NodeUpdater: mycluster-d8f4-ray-head: Applied config d70749231af87810bb73105e12cbf919596a5b9a  [LogTimer=10606ms]
2024-01-09 08:53:09,370 INFO updater.py:188 -- ␛[37mNew status␛[39m: ␛[1mup-to-date␛[22m
2024-01-09 08:53:09,376 INFO commands.py:868 -- ␛[36mUseful commands␛[39m
2024-01-09 08:53:09,376 INFO commands.py:870 -- Monitor autoscaling with
2024-01-09 08:53:09,376 INFO commands.py:871 -- ␛[1m  ray exec /root/.sky/generated/mycluster.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'␛[22m
2024-01-09 08:53:09,376 INFO commands.py:878 -- Connect to a terminal on the cluster head:
2024-01-09 08:53:09,376 INFO commands.py:879 -- ␛[1m  ray attach /root/.sky/generated/mycluster.yml␛[22m
2024-01-09 08:53:09,376 INFO commands.py:882 -- Get a remote shell to the cluster manually:
2024-01-09 08:53:09,376 INFO commands.py:883 --   ssh -o IdentitiesOnly=yes -i ~/.ssh/sky-key sky@10.42.0.36
Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.42.0.36' (ECDSA) to the list of known hosts.
======== Autoscaler status: 2024-01-08 16:53:10.671846 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 1 ray_worker_default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 A10
 0.0/140.0 CPU
 0.0/8.0 GPU
 0B/560.00GiB memory
 0B/953.67MiB object_store_memory

Demands:
 (no resource demands)

and my yaml is like this:

resources:
  # Optional; if left out, automatically pick the cheapest cloud.
  cloud: kubernetes
  # 1x NVIDIA V100 GPU
  accelerators: a10:4

num_nodes: 2

# file_mounts:
#   /datasets: /nas/data3/public/public_data/coco

# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: /root/SparseR-CNN

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
  echo "Running setup."
  # git clone https://github.com/PeizeSun/SparseR-CNN.git
  pip install torch==2.0.0 torchvision==0.15.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
  pip install scipy
  python setup.py build develop

# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
  echo "Hello, SkyPilot!"

  num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
  master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
  port=12355

  url=tcp://${master_addr}:${port}

  echo $master_addr
  echo $url

  export OMP_NUM_THREADS=70

  export CUDA_VISIBLE_DEVICES=0,1,2,3
  export NCCL_SOCKET_IFNAME=eth0
  export NCCL_IB_DISABLE=0
  export NCCL_DEBUG=INFO
  export GLOO_SOCKET_IFNAME=eth0

  python projects/SparseRCNN/train_net.py --num-gpus ${SKYPILOT_NUM_GPUS_PER_NODE} \
    --config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml \
    --machine-rank ${SKYPILOT_NODE_RANK} --num-machines $num_nodes --dist-url $url

and it is a On-Prem Kubernetes Deployment

In fact, normally speaking, when I run my training script after loading the model, it starts training, and some of the outputs under normal circumstances are as follows: (Works fine for both single node and local multi-machine training )

[01/08 17:01:22 d2.checkpoint.c2_model_loading]: The checkpoint state_dict contains keys that are not used by the model:
    (head, rank=0, pid=2701)   stem.fc.{bias, weight}
[01/08 17:22:51 d2.engine.train_loop]: Starting training from iteration 0
then here is the training information .... loss....

but now it will stop after loading the model like

[01/08 17:01:22 d2.checkpoint.c2_model_loading]: The checkpoint state_dict contains keys that are not used by the model:
    (head, rank=0, pid=2701)   stem.fc.{bias, weight}
(There's no output, but I'm monitoring the GPU and CPU resource as if it's still running.)

IMO, I suspect it's a cpu limit issue

However when I launch the 70cpus cluster and export OMP_NUM_THREADS=70 in yaml, it still doesn't seem to have changed.

romilbhardwaj commented 10 months ago

Thanks for the details @fourfireM - I haven't come across this issue before. Can you try bumping up memory and cpu using --memory and --cpus flags to sky launch?

Meanwhile, I will also try reproducing it. Should I try using your YAML with code from https://github.com/PeizeSun/SparseR-CNN.git and run sky launch task.yaml? Or are there any other steps I should follow?

fourfireM commented 10 months ago

Thank you for your answer and help, I am using the code in https://github.com/PeizeSun/SparseR-CNN.git to try to run the training, the run: in yaml is also based on SparseR-CNN. I would appreciate if you could try the code to run.

fourfireM commented 10 months ago

In fact there is nothing wrong with the project code itself, it just seems to be too slow using multiple nodes, I tried to let the training run all the time and it took about 8 hours to run the first iteration of the first epoch.

Is there anything I can do to improve the multi-node training, by the way I've tried using --memory and --cpus flags to sky launch like you said.

romilbhardwaj commented 10 months ago

Hmm that's interesting. I've previously run multi-gpu multi-node training (nemo example) on GKE cluster and I recall it working fine.

@landscapepainter is investigating this now. One blind guess would be to check would be if ulimits are set too low, and if increasing them helps.

fourfireM commented 10 months ago

Thanks for the reply, for reference this is my training output on single node：time: 0.4544

(worker1, rank=1, pid=1728, ip=10.42.2.30) [01/09 02:37:02 d2.utils.events]:  eta: 1 day, 10:05:09  iter: 19  total_loss: 60.7  loss_ce: 2.071  loss_giou: 2.395  loss_bbox: 8.479  loss_ce_0: 2.114  loss_giou_0: 2.014  loss_bbox_0: 4.383  loss_ce_1: 2.232 
loss_giou_1: 1.957  loss_bbox_1: 4.273  loss_ce_2: 2.185  loss_giou_2: 2.248  loss_bbox_2: 4.196  loss_ce_3: 2.286  loss_giou_3: 2.2  loss_bbox_3: 6.94  loss_ce_4: 2.15  loss_giou_4: 2.238  loss_bbox_4: 6.638  time: 0.4544  data_time: 0.3754  lr: 7.2025e-07  max_mem: 2873M

This is my training output on multiple nodes: time: 1386.3151

(head, rank=0, pid=8259) [01/09 10:46:29 d2.utils.events]:  eta: 4333 days, 21:08:17  iter: 19  total_loss: 60.84  loss_ce: 2.087  loss_giou: 2.371  loss_bbox: 8.264  loss_ce_0: 2.09  loss_giou_0: 2.044  loss_bbox_0: 4.391  loss_ce_1: 2.253  
loss_giou_1: 2.033  loss_bbox_1: 4.141  loss_ce_2: 2.209  loss_giou_2: 2.225  loss_bbox_2: 4.027  loss_ce_3: 2.306  loss_giou_3: 2.27  loss_bbox_3: 6.824  loss_ce_4: 2.081  loss_giou_4: 2.221  loss_bbox_4: 6.377  time: 1386.3151  data_time: 0.1655  lr: 7.2025e-07  max_mem: 2907

We can see that the training time is off by a very large factor.

In addition, I used ssh to enter the cluster head node, and then typed in ulimit -a and got the following output,

(base) sky@mycluster-d8f4-ray-head:~$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2579663
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

so I hope this can serve as some kind of reference.

landscapepainter commented 10 months ago

@fourfireM Thanks for reporting this issue! Do you mind sharing which version of torch and torchvision is currently being used in your setting? I was trying to reproduce with the Task YAML you provided, which installs torch and torchvision with pip install torch==2.0.0 torchvision==0.15.1 -i https://pypi.tuna.tsinghua.edu.cn/simple. But I encountered an error running this YAML due to some torchvision ImportError. I'm assuming the version of pytorch and torchvision being used are different. It can be discovered by running:

import torch
print(torch.__version__)
import torchvision
print(torchvision.__version__)

fourfireM commented 10 months ago

@landscapepainter In fact， -i https://pypi.tuna.tsinghua.edu.cn/simple is the mirror source I used for the download, you could probably just remove it for the download.

For reference my version of torch and torchvision is as follows:

>>> import torch
>>> import torchvision
>>> print(torch.__version__)
2.0.0+cu117
>>> print(torchvision.__version__)
0.15.1+cu117

landscapepainter commented 10 months ago

@fourfireM Thanks for the confirmation! Seems like we are using the same versions. I'm still trying to reproduce your error. I have one suggestion and a question: Suggestion: ray job scheduler uses CUDA_VISIBLE_DEVICES to assign jobs on those GPUs, so it's possible that more than training job is running on those GPUs if you specifiy export CUDA_VISIBLE_DEVICES=0,1,2,3. Do you mind trying out after removing those commands to see if there's a speed boost?

Question: While trying to reproduce your result, I'm encountering the following AttributeError and it seems like torch 2.0.0 does not have the attribute, _sync_params_and_buffers, for DistributedDataParallel anymore. What was your work around?

(head, rank=0, pid=2972) Traceback (most recent call last):
(head, rank=0, pid=2972)   File "/home/sky/sky_workdir/projects/SparseRCNN/train_net.py", line 153, in <module>
(head, rank=0, pid=2972)     launch(
(head, rank=0, pid=2972)   File "/home/sky/sky_workdir/detectron2/engine/launch.py", line 55, in launch
(head, rank=0, pid=2972)     mp.spawn(
(head, rank=0, pid=2972)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
(head, rank=0, pid=2972)     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
(head, rank=0, pid=2972)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
(head, rank=0, pid=2972)     while not context.join():
(head, rank=0, pid=2972)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
(head, rank=0, pid=2972)     raise ProcessRaisedException(msg, error_index, failed_process.pid)
(head, rank=0, pid=2972) torch.multiprocessing.spawn.ProcessRaisedException: 
(head, rank=0, pid=2972) 
(head, rank=0, pid=2972) -- Process 1 terminated with the following error:
(head, rank=0, pid=2972) Traceback (most recent call last):
(head, rank=0, pid=2972)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
(head, rank=0, pid=2972)     fn(i, *args)
(head, rank=0, pid=2972)   File "/home/sky/sky_workdir/detectron2/engine/launch.py", line 94, in _distributed_worker
(head, rank=0, pid=2972)     main_func(*args)
(head, rank=0, pid=2972)   File "/home/sky/sky_workdir/projects/SparseRCNN/train_net.py", line 146, in main
(head, rank=0, pid=2972)     trainer.resume_or_load(resume=args.resume)
(head, rank=0, pid=2972)   File "/home/sky/sky_workdir/detectron2/engine/defaults.py", line 334, in resume_or_load
(head, rank=0, pid=2972)     self.model._sync_params_and_buffers()
(head, rank=0, pid=2972)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
(head, rank=0, pid=2972)     raise AttributeError("'{}' object has no attribute '{}'".format(
(head, rank=0, pid=2972) AttributeError: 'DistributedDataParallel' object has no attribute '_sync_params_and_buffers'

fourfireM commented 10 months ago

I have encountered this problem, it is because pytorch version is too high, my way to deal with it is to comment out lines 333-335 in detectron2/engine/defaults.py, like the following, after that the code will work fine. @landscapepainter

330       if isinstance(self.model, DistributedDataParallel):
331            # broadcast loaded data/model from the first rank, because other
332            # machines may not have access to the checkpoint file
333        #    if TORCH_VERSION >= (1, 7):
334        #        self.model._sync_params_and_buffers()
335            self.start_iter = comm.all_gather(self.start_iter)[0]
336
337    def build_hooks(self):
338    #... code....

fourfireM commented 9 months ago

Did you have the same problem as me after trying to run it? I'm debugging the code internally right now, and I'm finding that the multi-node run gets stuck on the model merge weights. @landscapepainter

landscapepainter commented 9 months ago

@fourfireM Thanks for sharing what you discovered! I haven't made much progress yet since my last reply. I'll get back to this as soon as possible with what you provided in the previous comment. Meanwhile, if you happen to discover any other related problems, please share it with us! I absolutely appreciate your insights.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been stalled for 10 days with no activity.

skypilot-org / skypilot

[k8s] Can not running the train #2952