microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.61k stars 546 forks source link

Failing to Join to Cluster during deployment #5792

Closed brianjsl closed 1 year ago

brianjsl commented 1 year ago

Short summary about the issue/question: Am trying to deploy OpenPAI v.1.8.0 on a server with worker node pai-worker-04. I am running into the following error during join to cluster while running /bin/bash quick-start-kubespray.sh.

Error message received: fatal: [pai-worker-04]: FAILED! => {"changed": true, "cmd": ["timeout", "-k", "120s", "120s", "/usr/local/bin/kubeadm", "join", "--config", "/etc/kubernetes/kubeadm-client.conf", "--ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests"], "delta": "0:02:00.008026", "end": "2022-07-14 17:04:35.817974", "msg": "non-zero return code", "rc": 124, "start": "2022-07-14 17:02:35.809948", "stderr": "\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty", "stderr_lines": ["\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty"], "stdout": "[preflight] Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks"]}

Any help?

brianjsl commented 1 year ago

Issue closed. The problem was with the firewall on the master node.