Closed 123toorc closed 4 years ago
I think it is a problem with kubespray.
Could you please paste all logs from quick-start-kubespray.sh
?
Hi, thanks for help. here is the log, I paste the kubernetes play part, earlier one including the etcd and docker I believed it is correct. If you still need it I will be happy to do it. Thanks!!!
TASK [kubespray-defaults : Configure defaults] **********************************************************
Thursday 17 September 2020 10:50:11 +0800 (0:00:01.023) 0:04:39.798 ****
ok: [gpu3.test.bbdops.com] => {
"msg": "Check roles/kubespray-defaults/defaults/main.yml"
}
ok: [gpu2] => {
"msg": "Check roles/kubespray-defaults/defaults/main.yml"
}
TASK [kubernetes/node : look up docker cgroup driver] ***************************************************
Thursday 17 September 2020 10:50:11 +0800 (0:00:00.277) 0:04:40.075 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : set standalone_kubelet fact] ****************************************************
Thursday 17 September 2020 10:50:12 +0800 (0:00:00.573) 0:04:40.649 ****
ok: [gpu3.test.bbdops.com]
ok: [gpu2]
TASK [kubernetes/node : set kubelet_cgroup_driver_detected fact for containerd] *************************
Thursday 17 September 2020 10:50:12 +0800 (0:00:00.264) 0:04:40.914 ****
TASK [kubernetes/node : set kubelet_cgroup_driver_detected fact for other engines] **********************
Thursday 17 September 2020 10:50:12 +0800 (0:00:00.161) 0:04:41.075 ****
ok: [gpu3.test.bbdops.com]
ok: [gpu2]
TASK [kubernetes/node : os specific vars] ***************************************************************
Thursday 17 September 2020 10:50:12 +0800 (0:00:00.267) 0:04:41.342 ****
TASK [kubernetes/node : Pre-upgrade | check if kubelet container exists] ********************************
Thursday 17 September 2020 10:50:13 +0800 (0:00:00.180) 0:04:41.523 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Pre-upgrade | copy /var/lib/cni from kubelet] ***********************************
Thursday 17 September 2020 10:50:13 +0800 (0:00:00.540) 0:04:42.064 ****
TASK [kubernetes/node : Pre-upgrade | ensure kubelet container service is stopped if using host deployment] ***
Thursday 17 September 2020 10:50:13 +0800 (0:00:00.144) 0:04:42.209 ****
TASK [kubernetes/node : Pre-upgrade | ensure kubelet container is removed if using host deployment] *****
Thursday 17 September 2020 10:50:13 +0800 (0:00:00.153) 0:04:42.363 ****
TASK [kubernetes/node : Ensure /var/lib/cni exists] *****************************************************
Thursday 17 September 2020 10:50:14 +0800 (0:00:00.166) 0:04:42.529 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : install | Copy kubeadm binary from download dir] ********************************
Thursday 17 September 2020 10:50:14 +0800 (0:00:00.491) 0:04:43.021 ****
ok: [gpu2 -> 10.28.100.162]
TASK [kubernetes/node : install | Set kubeadm binary permissions] ***************************************
Thursday 17 September 2020 10:50:15 +0800 (0:00:00.581) 0:04:43.603 ****
ok: [gpu2]
TASK [kubernetes/node : install | Copy kubelet binary from download dir] ********************************
Thursday 17 September 2020 10:50:15 +0800 (0:00:00.483) 0:04:44.086 ****
ok: [gpu3.test.bbdops.com -> 10.28.100.165]
ok: [gpu2 -> 10.28.100.162]
TASK [kubernetes/node : install | Set kubelet binary permissions] ***************************************
Thursday 17 September 2020 10:50:16 +0800 (0:00:00.559) 0:04:44.646 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : install | Copy hyperkube binary from download dir] ******************************
Thursday 17 September 2020 10:50:16 +0800 (0:00:00.484) 0:04:45.131 ****
ok: [gpu3.test.bbdops.com -> 10.28.100.165]
ok: [gpu2 -> 10.28.100.162]
TASK [kubernetes/node : install | Set hyperkube binary permissions] *************************************
Thursday 17 September 2020 10:50:17 +0800 (0:00:00.547) 0:04:45.679 ****
ok: [gpu3.test.bbdops.com]
ok: [gpu2]
TASK [kubernetes/node : install | Copy socat wrapper for Container Linux] *******************************
Thursday 17 September 2020 10:50:17 +0800 (0:00:00.442) 0:04:46.121 ****
TASK [kubernetes/node : haproxy | Cleanup potentially deployed haproxy] *********************************
Thursday 17 September 2020 10:50:17 +0800 (0:00:00.149) 0:04:46.271 ****
ok: [gpu2]
TASK [kubernetes/node : nginx-proxy | Make nginx directory] *********************************************
Thursday 17 September 2020 10:50:18 +0800 (0:00:00.420) 0:04:46.692 ****
ok: [gpu2]
TASK [kubernetes/node : nginx-proxy | Write nginx-proxy configuration] **********************************
Thursday 17 September 2020 10:50:18 +0800 (0:00:00.432) 0:04:47.124 ****
ok: [gpu2]
TASK [kubernetes/node : nginx-proxy | Get checksum from config] *****************************************
Thursday 17 September 2020 10:50:19 +0800 (0:00:00.909) 0:04:48.034 ****
ok: [gpu2]
TASK [kubernetes/node : nginx-proxy | Write static pod] *************************************************
Thursday 17 September 2020 10:50:19 +0800 (0:00:00.451) 0:04:48.486 ****
changed: [gpu2]
TASK [kubernetes/node : haproxy | Cleanup potentially deployed nginx-proxy] *****************************
Thursday 17 September 2020 10:50:20 +0800 (0:00:00.680) 0:04:49.166 ****
TASK [kubernetes/node : haproxy | Make haproxy directory] ***********************************************
Thursday 17 September 2020 10:50:20 +0800 (0:00:00.158) 0:04:49.325 ****
TASK [kubernetes/node : haproxy | Write haproxy configuration] ******************************************
Thursday 17 September 2020 10:50:20 +0800 (0:00:00.156) 0:04:49.481 ****
TASK [kubernetes/node : haproxy | Get checksum from config] *********************************************
Thursday 17 September 2020 10:50:21 +0800 (0:00:00.158) 0:04:49.639 ****
TASK [kubernetes/node : haproxy | Write static pod] *****************************************************
Thursday 17 September 2020 10:50:21 +0800 (0:00:00.156) 0:04:49.796 ****
TASK [kubernetes/node : Ensure nodePort range is reserved] **********************************************
Thursday 17 September 2020 10:50:21 +0800 (0:00:00.151) 0:04:49.948 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Verify if br_netfilter module exists] *******************************************
Thursday 17 September 2020 10:50:22 +0800 (0:00:00.552) 0:04:50.501 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Verify br_netfilter module path exists] *****************************************
Thursday 17 September 2020 10:50:22 +0800 (0:00:00.454) 0:04:50.956 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Enable br_netfilter module] *****************************************************
Thursday 17 September 2020 10:50:22 +0800 (0:00:00.445) 0:04:51.402 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Persist br_netfilter module] ****************************************************
Thursday 17 September 2020 10:50:23 +0800 (0:00:00.656) 0:04:52.058 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Check if bridge-nf-call-iptables key exists] ************************************
Thursday 17 September 2020 10:50:24 +0800 (0:00:00.720) 0:04:52.779 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Enable bridge-nf-call tables] ***************************************************
Thursday 17 September 2020 10:50:24 +0800 (0:00:00.430) 0:04:53.210 ****
ok: [gpu2] => (item=net.bridge.bridge-nf-call-iptables)
ok: [gpu3.test.bbdops.com] => (item=net.bridge.bridge-nf-call-iptables)
ok: [gpu2] => (item=net.bridge.bridge-nf-call-arptables)
ok: [gpu3.test.bbdops.com] => (item=net.bridge.bridge-nf-call-arptables)
ok: [gpu2] => (item=net.bridge.bridge-nf-call-ip6tables)
ok: [gpu3.test.bbdops.com] => (item=net.bridge.bridge-nf-call-ip6tables)
TASK [kubernetes/node : Modprode Kernel Module for IPVS] ************************************************
Thursday 17 September 2020 10:50:27 +0800 (0:00:03.070) 0:04:56.280 ****
ok: [gpu2] => (item=ip_vs)
ok: [gpu3.test.bbdops.com] => (item=ip_vs)
ok: [gpu2] => (item=ip_vs_rr)
ok: [gpu3.test.bbdops.com] => (item=ip_vs_rr)
ok: [gpu2] => (item=ip_vs_wrr)
ok: [gpu2] => (item=ip_vs_sh)
ok: [gpu3.test.bbdops.com] => (item=ip_vs_wrr)
ok: [gpu3.test.bbdops.com] => (item=ip_vs_sh)
TASK [kubernetes/node : Modprobe nf_conntrack_ipv4 for kernels < 4.19] **********************************
Thursday 17 September 2020 10:50:29 +0800 (0:00:01.249) 0:04:57.529 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Modprobe nf_conntrack for kernels >= 4.19] **************************************
Thursday 17 September 2020 10:50:29 +0800 (0:00:00.567) 0:04:58.096 ****
TASK [kubernetes/node : Persist ip_vs modules] **********************************************************
Thursday 17 September 2020 10:50:29 +0800 (0:00:00.152) 0:04:58.249 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Purge proxy manifest for kubeadm or if proxy services being provided by other means, e.g. network_plugin] ***
Thursday 17 September 2020 10:50:30 +0800 (0:00:00.856) 0:04:59.105 ****
TASK [kubernetes/node : Cleanup kube-proxy leftovers from node] *****************************************
Thursday 17 September 2020 10:50:30 +0800 (0:00:00.151) 0:04:59.257 ****
TASK [kubernetes/node : include_tasks] ******************************************************************
Thursday 17 September 2020 10:50:30 +0800 (0:00:00.141) 0:04:59.398 ****
TASK [kubernetes/node : Write cacert file] **************************************************************
Thursday 17 September 2020 10:50:31 +0800 (0:00:00.151) 0:04:59.550 ****
TASK [kubernetes/node : Write cloud-config] *************************************************************
Thursday 17 September 2020 10:50:31 +0800 (0:00:00.151) 0:04:59.701 ****
TASK [kubernetes/node : Make sure dynamic kubelet configuration directory is writeable] *****************
Thursday 17 September 2020 10:50:31 +0800 (0:00:00.151) 0:04:59.852 ****
TASK [kubernetes/node : gets the kubeadm version] *******************************************************
Thursday 17 September 2020 10:50:31 +0800 (0:00:00.174) 0:05:00.027 ****
changed: [gpu2]
changed: [gpu3.test.bbdops.com]
TASK [kubernetes/node : sets kubeadm api version to v1beta1] ********************************************
Thursday 17 September 2020 10:50:32 +0800 (0:00:00.566) 0:05:00.594 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Write kubelet environment config file (kubeadm)] ********************************
Thursday 17 September 2020 10:50:32 +0800 (0:00:00.359) 0:05:00.954 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Write kubelet config file] ******************************************************
Thursday 17 September 2020 10:50:33 +0800 (0:00:01.025) 0:05:01.979 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Write kubelet systemd init file] ************************************************
Thursday 17 September 2020 10:50:34 +0800 (0:00:00.845) 0:05:02.825 ****
ok: [gpu2]
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/node : Enable kubelet] *****************************************************************
Thursday 17 September 2020 10:50:35 +0800 (0:00:00.814) 0:05:03.640 ****
changed: [gpu2]
changed: [gpu3.test.bbdops.com]
PLAY [kube-master] **************************************************************************************
TASK [download : prep_download | Set a few facts] *******************************************************
Thursday 17 September 2020 10:50:35 +0800 (0:00:00.675) 0:05:04.315 ****
TASK [download : Set image info command for containerd] *************************************************
Thursday 17 September 2020 10:50:35 +0800 (0:00:00.094) 0:05:04.410 ****
TASK [download : Register docker images info] ***********************************************************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.091) 0:05:04.502 ****
TASK [download : prep_download | Create staging directory on remote node] *******************************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.087) 0:05:04.589 ****
TASK [download : prep_download | Create local cache for files and images] *******************************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.088) 0:05:04.678 ****
TASK [download : prep_download | On localhost, check if passwordless root is possible] ******************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.093) 0:05:04.771 ****
TASK [download : prep_download | On localhost, check if user has access to docker without using sudo] ***
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.107) 0:05:04.879 ****
TASK [download : prep_download | Parse the outputs of the previous commands] ****************************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.096) 0:05:04.976 ****
TASK [download : prep_download | Check that local user is in group or can become root] ******************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.093) 0:05:05.069 ****
TASK [download : Use cri-o for cri connection] **********************************************************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.091) 0:05:05.161 ****
TASK [download : Use containerd for cri connetion] ******************************************************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.083) 0:05:05.244 ****
TASK [download : Use docker for cri connetion] **********************************************************
Thursday 17 September 2020 10:50:36 +0800 (0:00:00.090) 0:05:05.335 ****
ok: [gpu3.test.bbdops.com]
TASK [download : include_tasks] *************************************************************************
Thursday 17 September 2020 10:50:37 +0800 (0:00:00.215) 0:05:05.550 ****
TASK [download : download | Get kubeadm binary and list of required images] *****************************
Thursday 17 September 2020 10:50:37 +0800 (0:00:00.113) 0:05:05.664 ****
TASK [download : download | Download files / images] ****************************************************
Thursday 17 September 2020 10:50:37 +0800 (0:00:00.096) 0:05:05.760 ****
TASK [download : download | Sync files / images from ansible host to nodes] *****************************
Thursday 17 September 2020 10:50:38 +0800 (0:00:00.963) 0:05:06.723 ****
TASK [kubespray-defaults : Configure defaults] **********************************************************
Thursday 17 September 2020 10:50:39 +0800 (0:00:00.964) 0:05:07.687 ****
ok: [gpu3.test.bbdops.com] => {
"msg": "Check roles/kubespray-defaults/defaults/main.yml"
}
TASK [kubernetes/tokens : Check_tokens | check if the tokens have already been generated on first master] ***
Thursday 17 September 2020 10:50:39 +0800 (0:00:00.257) 0:05:07.945 ****
TASK [kubernetes/tokens : Check_tokens | Set default value for 'sync_tokens' and 'gen_tokens' to false] ***
Thursday 17 September 2020 10:50:39 +0800 (0:00:00.088) 0:05:08.033 ****
TASK [kubernetes/tokens : Check_tokens | Set 'sync_tokens' and 'gen_tokens' to true] ********************
Thursday 17 September 2020 10:50:39 +0800 (0:00:00.097) 0:05:08.131 ****
TASK [kubernetes/tokens : Check tokens | check if a cert already exists] ********************************
Thursday 17 September 2020 10:50:39 +0800 (0:00:00.088) 0:05:08.219 ****
TASK [kubernetes/tokens : Check_tokens | Set 'sync_tokens' to true] *************************************
Thursday 17 September 2020 10:50:39 +0800 (0:00:00.087) 0:05:08.307 ****
TASK [kubernetes/tokens : Make sure the tokens directory exits] *****************************************
Thursday 17 September 2020 10:50:39 +0800 (0:00:00.087) 0:05:08.395 ****
TASK [kubernetes/tokens : Gen_tokens | copy tokens generation script] ***********************************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.137) 0:05:08.533 ****
TASK [kubernetes/tokens : Gen_tokens | generate tokens for master components] ***************************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.169) 0:05:08.702 ****
TASK [kubernetes/tokens : Gen_tokens | generate tokens for node components] *****************************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.152) 0:05:08.855 ****
TASK [kubernetes/tokens : Gen_tokens | Get list of tokens from first master] ****************************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.165) 0:05:09.020 ****
TASK [kubernetes/tokens : Gen_tokens | Gather tokens] ***************************************************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.143) 0:05:09.163 ****
TASK [kubernetes/tokens : Gen_tokens | Copy tokens on masters] ******************************************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.102) 0:05:09.266 ****
TASK [kubernetes/master : Pre-upgrade | Delete master manifests if etcd secrets changed] ****************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.088) 0:05:09.355 ****
TASK [kubernetes/master : Pre-upgrade | Delete master containers forcefully] ****************************
Thursday 17 September 2020 10:50:40 +0800 (0:00:00.129) 0:05:09.484 ****
TASK [kubernetes/master : Make sure the users directory exits] ******************************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.129) 0:05:09.613 ****
TASK [kubernetes/master : Populate users for basic auth in API] *****************************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.090) 0:05:09.704 ****
TASK [kubernetes/master : Create webhook token auth config] *********************************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.097) 0:05:09.801 ****
TASK [kubernetes/master : Check if secret for encrypting data at rest already exist] ********************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.092) 0:05:09.894 ****
TASK [kubernetes/master : Slurp secrets_encryption file if it exists] ***********************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.087) 0:05:09.982 ****
TASK [kubernetes/master : Base 64 Decode slurped secrets_encryption.yaml file] **************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.092) 0:05:10.074 ****
TASK [kubernetes/master : Extract secret value from secrets_encryption.yaml] ****************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.082) 0:05:10.157 ****
TASK [kubernetes/master : Set kube_encrypt_token across master nodes] ***********************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.145) 0:05:10.303 ****
TASK [kubernetes/master : Write secrets for encrypting secret data at rest] *****************************
Thursday 17 September 2020 10:50:41 +0800 (0:00:00.090) 0:05:10.393 ****
TASK [kubernetes/master : Install | Copy kubectl binary from download dir] ******************************
Thursday 17 September 2020 10:50:42 +0800 (0:00:00.146) 0:05:10.540 ****
ok: [gpu3.test.bbdops.com -> 10.28.100.165]
TASK [kubernetes/master : install | Set kubectl binary permissions] *************************************
Thursday 17 September 2020 10:50:42 +0800 (0:00:00.545) 0:05:11.086 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : Install kubectl bash completion] **********************************************
Thursday 17 September 2020 10:50:43 +0800 (0:00:00.418) 0:05:11.505 ****
changed: [gpu3.test.bbdops.com]
TASK [kubernetes/master : Set kubectl bash completion file permissions] *********************************
Thursday 17 September 2020 10:50:43 +0800 (0:00:00.627) 0:05:12.132 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : Disable SecurityContextDeny admission-controller and enable PodSecurityPolicy] ***
Thursday 17 September 2020 10:50:44 +0800 (0:00:00.427) 0:05:12.559 ****
TASK [kubernetes/master : kubeadm | Check if old apiserver cert exists on host] *************************
Thursday 17 September 2020 10:50:44 +0800 (0:00:00.146) 0:05:12.706 ****
ok: [gpu3.test.bbdops.com -> 10.28.100.165]
TASK [kubernetes/master : Copy old certs to the kubeadm expected path] **********************************
Thursday 17 September 2020 10:50:44 +0800 (0:00:00.416) 0:05:13.122 ****
TASK [kubernetes/master : Install OIDC certificate] *****************************************************
Thursday 17 September 2020 10:50:44 +0800 (0:00:00.250) 0:05:13.373 ****
TASK [kubernetes/master : kubeadm | Check serviceaccount key] *******************************************
Thursday 17 September 2020 10:50:44 +0800 (0:00:00.096) 0:05:13.469 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : kubeadm | Check if kubeadm has already run] ***********************************
Thursday 17 September 2020 10:50:45 +0800 (0:00:00.416) 0:05:13.885 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : kubeadm | Delete old admin.conf] **********************************************
Thursday 17 September 2020 10:50:45 +0800 (0:00:00.416) 0:05:14.301 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : kubeadm | Delete old static pods] *********************************************
Thursday 17 September 2020 10:50:46 +0800 (0:00:00.421) 0:05:14.723 ****
TASK [kubernetes/master : kubeadm | Forcefully delete old static pods] **********************************
Thursday 17 September 2020 10:50:46 +0800 (0:00:00.139) 0:05:14.863 ****
TASK [kubernetes/master : kubeadm | aggregate all SANs] *************************************************
Thursday 17 September 2020 10:50:46 +0800 (0:00:00.116) 0:05:14.979 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : Create audit-policy directory] ************************************************
Thursday 17 September 2020 10:50:47 +0800 (0:00:00.589) 0:05:15.569 ****
TASK [kubernetes/master : Write api audit policy yaml] **************************************************
Thursday 17 September 2020 10:50:47 +0800 (0:00:00.090) 0:05:15.660 ****
TASK [kubernetes/master : set kubeadm_config_api_fqdn define] *******************************************
Thursday 17 September 2020 10:50:47 +0800 (0:00:00.087) 0:05:15.748 ****
TASK [kubernetes/master : gets the kubeadm version] *****************************************************
Thursday 17 September 2020 10:50:47 +0800 (0:00:00.090) 0:05:15.839 ****
changed: [gpu3.test.bbdops.com]
TASK [kubernetes/master : sets kubeadm api version to v1beta1] ******************************************
Thursday 17 September 2020 10:50:47 +0800 (0:00:00.472) 0:05:16.312 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : sets kubeadm api version to v1beta2] ******************************************
Thursday 17 September 2020 10:50:48 +0800 (0:00:00.204) 0:05:16.516 ****
ok: [gpu3.test.bbdops.com]
TASK [kubernetes/master : kubeadm | Create kubeadm config] **********************************************
Thursday 17 September 2020 10:50:48 +0800 (0:00:00.205) 0:05:16.721 ****
changed: [gpu3.test.bbdops.com]
TASK [kubernetes/master : Backup old certs and keys] ****************************************************
Thursday 17 September 2020 10:50:49 +0800 (0:00:01.395) 0:05:18.117 ****
TASK [kubernetes/master : kubeadm | Initialize first master] ********************************************
Thursday 17 September 2020 10:50:49 +0800 (0:00:00.186) 0:05:18.303 ****
FAILED - RETRYING: kubeadm | Initialize first master (3 retries left).
Hi Zhiyuan,
From the kubelet service journal log below, I feel it is like kube-apiserver is not running, which I do have the docker image, but it never run. Thanks.
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: I0917 11:26:56.746230 32353 kubelet.go:307] Watching apiserver
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: E0917 11:26:56.747305 32353 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:454: Failed to list *v1.Node: Get https://10.28.100.165:6443/api/v1/nodes?fieldSelector=metadata.name%3Dgpu3.test.bbdops.com&limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: E0917 11:26:56.747306 32353 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:445: Failed to list *v1.Service: Get https://10.28.100.165:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: E0917 11:26:56.747305 32353 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.28.100.165:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dgpu3.test.bbdops.com&limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: I0917 11:26:56.748726 32353 client.go:75] Connecting to docker on unix:///var/run/docker.sock
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: I0917 11:26:56.748750 32353 client.go:104] Start docker client with request timeout=2m0s
I found exactly same problem here: https://github.com/kubernetes/kubernetes/issues/79046
Kubernetes 1.14.x adds pid limit and it requires your kernel to support it. I'm afraid the Linux kernel on your master node doesn't support pid control in cgroups. To confirm, could you please paste the result of cat /proc/cgroups
on your master node?
Amazing, if it is kernel issue.. That will be super easy to fix in my opinion...
$ cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 6 39 1
cpu 2 136 1
cpuacct 2 136 1
blkio 10 136 1
memory 4 427 1
devices 8 136 1
freezer 3 39 1
net_cls 5 39 1
perf_event 9 39 1
net_prio 5 39 1
hugetlb 7 39 1
You can see you don't have a pid cgroups like the following:
I suggest:
On the dev box machine, go to ~/pai-deploy/kubespray
, run ansible-playbook -i inventory/pai/hosts.yml reset.yml --become --become-user=root -e "@inventory/pai/openpai.yml"
to remove current broken k8s totally.
Fix this pid cgroups issue on all your worker and master nodes.
Have a retry.
Thanks! will do, This is the root cause right? I have a question and I just curious about it, since from my point of view, I feel it first can't connect to the apiserver then prompt the cgroup issue. Would you please tell me some detail relate to this? Thank you very much!
In my opinion, API server connection problem is not the root cause. Sometimes, api server is down, and kubelet will just try to connect to it periodically. In other words, api server connection problem won't cause a fatal error in kubelet. Of course, maybe I'm wrong and you can dive into Kubernetes' source code to find out more :-)
Thank you! Super appreciated! Plus I'm not good at swimming in the code sea :D
@123toorc hello, do you know how to fix pid cgroups
, i met the same situation...
@123toorc hello, do you know how to fix
pid cgroups
, i met the same situation...
Yes, kindly update your kernel or enable the feature if your kernel is latest. The pids feature was introduced in vanilla kernel>=4.3
@hzy46 @123toorc 两位好,我发现似乎和内核无关,我在7.2, 7.6, 7.8上查看/proc/cgroups 大多数没有升级过内核的机器上都有pids,目前只有一台Centos7.8的机器上没这个pids,暂时没找到原因
你要看linux kernel 的config, /boot/configxxxxx, 或者 /proc/config.bz 看看CONFIG_CGROUP_PIDS. 可能没有enable。
/proc 接口没有的话,大概率是内核没有支持这个feature吧。
zhangguanzhang notifications@github.com 于2020年10月20日周二 下午12:31写道:
@hzy46 https://github.com/hzy46 @123toorc https://github.com/123toorc 两位好,我发现似乎和内核无关,我在7.2, 7.6, 7.8上查看/proc/cgroups 大多数没有升级过内核的机器上都有pids,目前只有一台Centos7.8的机器上没这个pids,暂时没找到原因
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/pai/issues/4909#issuecomment-712581428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPH47EID25PLVWA6BCFSR3SLUHCHANCNFSM4ROC66MA .
# grep PIDS /boot/config-3.10.0-1062.el7.x86_64
CONFIG_CGROUP_PIDS=y
手动加这个能开启吗
Eh.你的kernel是3.10的? 不对,centos 7.8很新。我不是很清楚centos,嗯,你要去看kernel version,如果config没有,可能是cgroup没打开。不然会是 config is not set. 手动加,要重新编译内核。
zhangguanzhang notifications@github.com 于 2020年10月20日周二 12:52写道:
grep PIDS /boot/config-3.10.0-1062.el7.x86_64
CONFIG_CGROUP_PIDS=y
手动加这个能开启吗
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/pai/issues/4909#issuecomment-712587054, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPH47HACRMRB5SEBDIAPJLSLUJQNANCNFSM4ROC66MA .
centos是backport,和github的kernel tag不一样,是红帽自己维护的,centos7都是3.10多的内核,config文件就上面那样里面。 我知道了,我这机器的内核被人升级过了,没这个
CONFIG_CGROUP_PIDS
_root@fxkj:/boot# cat config-5.4.0-42-generic | grep CONFIG_CGROUP_PIDS CONFIG_CGROUPPIDS=y
_root@fxkj:/boot# cat /proc/cgroups
cpuset 5 14 1 cpu 4 147 1 cpuacct 4 147 1 blkio 8 147 1 memory 7 286 1 devices 9 147 1 freezer 6 14 1 net_cls 10 14 1 perf_event 3 14 1 netprio 10 14 1 hugetlb 11 14 1 pids 2 192 1 rdma 12 1 1
@123toorc 我的pid 是打开的,但是也出现了你这个问题,请问要怎么解决呢,感谢!
/bin/bash quick-start-kubespray.sh 报错: _TASK [kubernetes/master : kubeadm | Initialize first master] *** Saturday 09 October 2021 14:54:46 +0800 (0:00:00.106) 0:03:29.453 ** FAILED - RETRYING: kubeadm | Initialize first master (3 retries left). FAILED - RETRYING: kubeadm | Initialize first master (2 retries left). FAILED - RETRYING: kubeadm | Initialize first master (1 retries left). fatal: [pai-master]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["timeout", "-k", "300s", "300s", "/usr/local/bin/kubeadm", "init", "--config=/etc/kubernetes/kubeadm-config.yaml", "--ignore-preflight-errors=all", "--skip-phases=addon/coredns", "--upload-certs"], "delta": "0:05:00.008053", "end": "2021-10-09 15:15:02.734350", "failed_whenresult": true, "msg": "non-zero return code", "rc": 124, "start": "2021-10-09 15:10:02.726297", "stderr"
手动执行这个命令看看输出
手动执行这个命令看看输出 已经找到问题了,是仓库配置有的镜像没有pull下来,导致某些镜像没有顺序启动,这个报错牛头不对马嘴,搞了我一天。
Organization Name: BBD
Short summary about the issue/question: The quick-start-kubernetes script was failed at the initializing the first master.
How to reproduce it: I have checked the kubelet service and etcd service. Kubernetes related Docker images is not downloaded by script, instead, I download some of it manually.
OpenPAI Environment:
uname -a
): 4.2.0-42-genericAnything else we need to know: I paste the log of the kubelet service:
This is the output of the docker images, it would not run during the kubeadm init the first master, I guess this is the issue.
This is the config file overview. Please help me, thanks!
PLEASE HELP ME! THANK YOU VERY MUCH!