Closed 0x1un closed 1 year ago
最终集群是否部署成功?如果集群成功,这个可能是节点注册太慢,节点还没成功注册,脚本已经执行了。我后续加个判断处理一下
你好,这几天有点事情耽搁了,节点部署后 kubectl get nodes 没有节点加入进来。
刚才我在本地虚拟机创建了一组测试集群,完全干净的centos7环境,完整跑了一遍最新的playbook。 这是我的inventory:
# 本组内填写etcd服务器及主机名(会校验是否以横杠分割),集群节点名称会截取主机名以横杠分割后两段
[etcd]
192.168.137.201 hostname=etcd-01 ansible_ssh_pass="123456"
[haproxy]
192.168.137.101 hostname=haproxy-01 type=BACKUP priority=90 ansible_ssh_pass="123456"
192.168.137.102 hostname=haproxy-02 type=MASTER priority=100 ansible_ssh_pass="123456"
# 本组内填写master服务器及主机名(会校验是否以横杠分割),集群节点名称会截取主机名以横杠分割后两段
[master]
192.168.137.10 hostname=master-01 ansible_ssh_pass="123456"
# 本组内填写worker服务器及主机名(会校验是否以横杠分割),集群节点名称会截取主机名以横杠分割后两段
# 最后面添加gpu=true 表示节点为GPU节点,运行时会配置使用GPU并且添加nvidia.com/gpu=true标签
# 不是GPU节点时,可去掉gpu配置项
# 启用GPU时,请先在节点按照 https://nvidia.github.io/libnvidia-container/ 配置软件源或同步相关包到私服
[worker]
192.168.137.11 hostname=worker-01 gpu=false ansible_ssh_pass="123456"
kubelet日志提示:
Feb 15 13:36:26 worker-01 kubelet[15536]: E0215 13:36:26.179937 15536 certificate_manager.go:471] kubernetes.io/kube-apiserver-client-kubelet: Failed while requesting a signed certificate from the control plane: cannot create certificate signing request: Post "https://172.16.90.100:6443/apis/certificates.k8s.io/v1/certificatesigningrequests": dial tcp 172.16.90.100:6443: connect: connection refused
我发现ip a中并没有创建172.16.90.x网段:
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:aa:a3:91 brd ff:ff:ff:ff:ff:ff
inet 192.168.137.11/24 brd 192.168.137.255 scope global noprefixroute ens33
valid_lft forever preferred_lft forever
inet6 fe80::876f:be9e:31dd:d657/64 scope link noprefixroute
valid_lft forever preferred_lft forever
查看containerd日志提示: cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config
# systemctl status containerd -l
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2023-02-15 13:29:10 CST; 11min ago
Docs: https://containerd.io
Main PID: 18078 (containerd)
Tasks: 9
Memory: 22.3M
CGroup: /system.slice/containerd.service
└─18078 /usr/local/bin/containerd
Feb 15 13:29:10 master-01 containerd[18078]: time="2023-02-15T13:29:10.934210761+08:00" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Feb 15 13:29:10 master-01 containerd[18078]: time="2023-02-15T13:29:10.935368901+08:00" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
Feb 15 13:29:10 master-01 containerd[18078]: time="2023-02-15T13:29:10.935427990+08:00" level=info msg=serving... address=/run/containerd/containerd.sock
Feb 15 13:29:10 master-01 containerd[18078]: time="2023-02-15T13:29:10.936674204+08:00" level=info msg="containerd successfully booted in 0.044534s"
Feb 15 13:29:10 master-01 containerd[18078]: time="2023-02-15T13:29:10.936722241+08:00" level=info msg="Start subscribing containerd event"
Feb 15 13:29:10 master-01 containerd[18078]: time="2023-02-15T13:29:10.936762161+08:00" level=info msg="Start recovering state"
Feb 15 13:29:10 master-01 containerd[18078]: time="2023-02-15T13:29:10.936882597+08:00" level=info msg="Start event monitor"
其中的playbook执行统计:
PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************************************
192.168.137.10 : ok=92 changed=73 unreachable=0 failed=1 skipped=13 rescued=0 ignored=4
192.168.137.101 : ok=39 changed=27 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
192.168.137.102 : ok=35 changed=27 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
192.168.137.11 : ok=62 changed=49 unreachable=0 failed=1 skipped=13 rescued=0 ignored=0
192.168.137.201 : ok=39 changed=29 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0
localhost : ok=60 changed=47 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.137.10、11 失败的部分:
TASK [worker : Check whether the worker is ready?] *****************************************************************************************************************************************************************************************************************************************************
FAILED - RETRYING: [192.168.137.10]: Check whether the worker is ready? (5 retries left).
FAILED - RETRYING: [192.168.137.11 -> 192.168.137.10]: Check whether the worker is ready? (5 retries left).
FAILED - RETRYING: [192.168.137.10]: Check whether the worker is ready? (4 retries left).
FAILED - RETRYING: [192.168.137.11 -> 192.168.137.10]: Check whether the worker is ready? (4 retries left).
FAILED - RETRYING: [192.168.137.10]: Check whether the worker is ready? (3 retries left).
FAILED - RETRYING: [192.168.137.11 -> 192.168.137.10]: Check whether the worker is ready? (3 retries left).
FAILED - RETRYING: [192.168.137.10]: Check whether the worker is ready? (2 retries left).
FAILED - RETRYING: [192.168.137.11 -> 192.168.137.10]: Check whether the worker is ready? (2 retries left).
FAILED - RETRYING: [192.168.137.10]: Check whether the worker is ready? (1 retries left).
FAILED - RETRYING: [192.168.137.11 -> 192.168.137.10]: Check whether the worker is ready? (1 retries left).
failed: [192.168.137.10] (item=192.168.137.10) => {"ansible_loop_var": "item", "attempts": 5, "changed": true, "cmd": "kubectl get node | grep master-01-192.168.137.10", "delta": "0:00:00.122645", "end": "2023-02-15 13:34:14.207115", "item": "192.168.137.10", "msg": "non-zero return code", "rc": 1, "start": "2023-02-15 13:34:14.084470", "stderr": "No resources found", "stderr_lines": ["No resources found"], "stdout": "", "stdout_lines": []}
failed: [192.168.137.11 -> 192.168.137.10] (item=192.168.137.10) => {"ansible_loop_var": "item", "attempts": 5, "changed": true, "cmd": "kubectl get node | grep worker-01-192.168.137.11", "delta": "0:00:00.084675", "end": "2023-02-15 13:34:14.201377", "item": "192.168.137.10", "msg": "non-zero return code", "rc": 1, "start": "2023-02-15 13:34:14.116702", "stderr": "No resources found", "stderr_lines": ["No resources found"], "stdout": "", "stdout_lines": []}
192.168.137.10 ignoring的部分:
TASK [worker : Check if bootstrap-token exists] ********************************************************************************************************************************************************************************************************************************************************
fatal: [192.168.137.10]: FAILED! => {"changed": true, "cmd": "kubectl -n kube-system get secret bootstrap-token-f24e1b", "delta": "0:00:00.130208", "end": "2023-02-15 13:31:02.274777", "msg": "non-zero return code", "rc": 1, "start": "2023-02-15 13:31:02.144569", "stderr": "Error from server (NotFound): secrets \"bootstrap-token-f24e1b\" not found", "stderr_lines": ["Error from server (NotFound): secrets \"bootstrap-token-f24e1b\" not found"], "stdout": "", "stdout_lines": []}
...ignoring
TASK [worker : Create bootstrap-token secret] **********************************************************************************************************************************************************************************************************************************************************
changed: [192.168.137.10]
TASK [worker : Check if clusterrolebinding kubelet-bootstrap exists] ***********************************************************************************************************************************************************************************************************************************
fatal: [192.168.137.10]: FAILED! => {"changed": true, "cmd": "kubectl get clusterrolebinding kubelet-bootstrap", "delta": "0:00:00.062390", "end": "2023-02-15 13:31:03.230850", "msg": "non-zero return code", "rc": 1, "start": "2023-02-15 13:31:03.168460", "stderr": "Error from server (NotFound): clusterrolebindings.rbac.authorization.k8s.io \"kubelet-bootstrap\" not found", "stderr_lines": ["Error from server (NotFound): clusterrolebindings.rbac.authorization.k8s.io \"kubelet-bootstrap\" not found"], "stdout": "", "stdout_lines": []}
...ignoring
TASK [worker : Create clusterrolebinding kubelet-bootstrap] ********************************************************************************************************************************************************************************************************************************************
changed: [192.168.137.10]
TASK [worker : Check if node-autoapprove-bootstrap exists] *********************************************************************************************************************************************************************************************************************************************
fatal: [192.168.137.10]: FAILED! => {"changed": true, "cmd": "kubectl get clusterrolebinding node-autoapprove-bootstrap", "delta": "0:00:00.063945", "end": "2023-02-15 13:31:04.234837", "msg": "non-zero return code", "rc": 1, "start": "2023-02-15 13:31:04.170892", "stderr": "Error from server (NotFound): clusterrolebindings.rbac.authorization.k8s.io \"node-autoapprove-bootstrap\" not found", "stderr_lines": ["Error from server (NotFound): clusterrolebindings.rbac.authorization.k8s.io \"node-autoapprove-bootstrap\" not found"], "stdout": "", "stdout_lines": []}
...ignoring
TASK [worker : Create clusterrolebinding node-autoapprove-bootstrap] ***********************************************************************************************************************************************************************************************************************************
changed: [192.168.137.10]
TASK [worker : Check if clusterrolebinding node-autoapprove-certificate-rotation exists] ***************************************************************************************************************************************************************************************************************
fatal: [192.168.137.10]: FAILED! => {"changed": true, "cmd": "kubectl get clusterrolebinding node-autoapprove-certificate-rotation", "delta": "0:00:00.111519", "end": "2023-02-15 13:31:05.209951", "msg": "non-zero return code", "rc": 1, "start": "2023-02-15 13:31:05.098432", "stderr": "Error from server (NotFound): clusterrolebindings.rbac.authorization.k8s.io \"node-autoapprove-certificate-rotation\" not found", "stderr_lines": ["Error from server (NotFound): clusterrolebindings.rbac.authorization.k8s.io \"node-autoapprove-certificate-rotation\" not found"], "stdout": "", "stdout_lines": []}
...ignoring
group_vars/all.yml 中需要指定负载均衡的VIP和端口的
如果你只有一个master节点,是可以跳过安装haproxy和keepalived的,group_vars/all.yml 中lb的ip和端口指定为这一台master节点的IP和apiserver的端口,然后执行以下命令即可。 ansible-playbook cluster.yml -i inventory --skip-tags=haproxy,keepalived
多谢,清理了环境修改loadbalance.ip 重新跑了一遍解决了,大意了:)。
输出太长了,我截取了一部分。