microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

[Kubeadm] Initialize master failed #4909

Closed 123toorc closed 4 years ago

123toorc commented 4 years ago

Organization Name: BBD

Short summary about the issue/question: The quick-start-kubernetes script was failed at the initializing the first master.

How to reproduce it: I have checked the kubelet service and etcd service. Kubernetes related Docker images is not downloaded by script, instead, I download some of it manually.

OpenPAI Environment:

Anything else we need to know: I paste the log of the kubelet service:

Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.358902   19129 server.go:1081] Started kubelet
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.358942   19129 kubelet.go:1294] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.358978   19129 server.go:145] Starting to listen on 10.28.100.165:10250
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.359345   19129 event.go:249] Unable to write event: 'Post https://10.28.100.165:6443/api/v1/namespaces/default/events: dial tcp 10.28.100.165:6443: connect: connection refused' (may retry after sleeping)
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.360106   19129 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.360147   19129 status_manager.go:152] Starting to sync pod status with apiserver
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.360166   19129 kubelet.go:1809] Starting kubelet main sync loop.
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.360220   19129 volume_manager.go:241] The desired_state_of_world populator starts
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.360233   19129 volume_manager.go:243] Starting Kubelet Volume Manager
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.360287   19129 kubelet.go:1826] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.360622   19129 desired_state_of_world_populator.go:130] Desired state populator starts to run
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.360834   19129 controller.go:125] failed to ensure node lease exists, will retry in 200ms, error: Get https://10.28.100.165:6443/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/?timeout=10s: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.361263   19129 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://10.28.100.165:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.361523   19129 server.go:381] Adding debug handlers to kubelet server.
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.361754   19129 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https://10.28.100.165:6443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.362098   19129 kubelet.go:2173] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.380271   19129 factory.go:356] Registering Docker factory
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.380298   19129 factory.go:54] Registering systemd factory
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.380500   19129 factory.go:100] Registering Raw factory
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.380697   19129 manager.go:1159] Started watching for new ooms in manager
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.386648   19129 nvidia.go:100] NVML initialized. Number of nvidia devices: 1
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.397275   19129 manager.go:272] Starting recovery of all containers
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.427282   19129 manager.go:277] Recovery completed
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.460440   19129 kubelet.go:1826] skipping pod synchronization - container runtime status check may not have completed yet
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.460686   19129 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.460699   19129 kubelet.go:2252] node "" not found
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.461357   19129 setters.go:73] Using node IP: "10.28.100.165"
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.463688   19129 kubelet_node_status.go:471] Recording NodeHasSufficientMemory event message for node 
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.463733   19129 kubelet_node_status.go:471] Recording NodeHasNoDiskPressure event message for node 
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.463749   19129 kubelet_node_status.go:471] Recording NodeHasSufficientPID event message for node 
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.463786   19129 kubelet_node_status.go:72] Attempting to register node 
Sep 15 16:28:35  kubelet[19129]: E0915 16:28:35.464285   19129 kubelet_node_status.go:94] Unable to register node "" with API server: Post https://10.28.100.165:6443/api/v1/nodes: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.510588   19129 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.510984   19129 setters.go:73] Using node IP: "10.28.100.165"
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.512745   19129 kubelet_node_status.go:471] Recording NodeHasSufficientMemory event message for node 
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.512768   19129 kubelet_node_status.go:471] Recording NodeHasNoDiskPressure event message for node 
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.512777   19129 kubelet_node_status.go:471] Recording NodeHasSufficientPID event message for node 
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.512806   19129 cpu_manager.go:155] [cpumanager] starting with none policy
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.512812   19129 cpu_manager.go:156] [cpumanager] reconciling every 10s
Sep 15 16:28:35  kubelet[19129]: I0915 16:28:35.512828   19129 policy_none.go:42] [cpumanager] none policy: Start
Sep 15 16:28:35  kubelet[19129]: F0915 16:28:35.513490   19129 kubelet.go:1372] Failed to start ContainerManager failed to initialize top level QOS containers: failed to update top level Burstable QOS cgroup : failed to set supported cgroup subsystems for cgroup [kubepods burstable]: Failed to find subsystem mount for required subsystem: pids
Sep 15 16:28:35  systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Sep 15 16:28:35  systemd[1]: kubelet.service: Unit entered failed state.
Sep 15 16:28:35  systemd[1]: kubelet.service: Failed with result 'exit-code'.
Sep 15 16:28:45  systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Sep 15 16:28:45  systemd[1]: Stopped Kubernetes Kubelet Server.

This is the output of the docker images, it would not run during the kubeadm init the first master, I guess this is the issue.

$ docker images
REPOSITORY                                                        TAG                 IMAGE ID            CREATED             SIZE
registry.aliyuncs.com/google_containers/kube-apiserver            v1.15.11            0eaa5e1d871a        6 months ago        207MB
registry.aliyuncs.com/google_containers/kube-proxy                v1.15.11            7cd3972af624        6 months ago        82.5MB
registry.aliyuncs.com/google_containers/kube-controller-manager   v1.15.11            4d53b9ec5d96        6 months ago        159MB
registry.aliyuncs.com/google_containers/kube-scheduler            v1.15.11            e671c2a84bb9        6 months ago        81.2MB
coredns/coredns                                                   1.6.0               680bc53e5985        13 months ago       42.2MB
calico/node                                                       v3.7.3              bf4ff15c9db0        15 months ago       156MB
calico/cni                                                        v3.7.3              1a6ade52d471        15 months ago       135MB
calico/kube-controllers                                           v3.7.3              283860d96794        15 months ago       46.8MB
nginx                                                             1.15                53f3fd8007f7        16 months ago       109MB
mirrorgooglecontainers/cluster-proportional-autoscaler-amd64      1.6.0               dfe4432cd2e2        16 months ago       47.7MB
registry.aliyuncs.com/google_containers/coredns                   1.3.1               eb516548c180        20 months ago       40.3MB
quay.io/coreos/etcd                                               v3.3.10             643c21638c1c        23 months ago       39.5MB
mirrorgooglecontainers/pause-amd64                                3.1                 da86e6ba6ca1        2 years ago         742kB
registry.aliyuncs.com/google_containers/pause                     3.1                 da86e6ba6ca1        2 years ago         742kB
$ docker ps -a
CONTAINER ID        IMAGE                         COMMAND                 CREATED             STATUS              PORTS               NAMES
7016ec2a8a4e        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   24 minutes ago      Up 24 minutes                           etcd1

This is the config file overview. Please help me, thanks!

gcr_image_repo: "registry.aliyuncs.com/google_containers"
kube_image_repo: "registry.aliyuncs.com/google_containers"
kubeadm_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/kubeadm"
hyperkube_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/hyperkube"

openpai_kubespray_extra_var:
  pod_infra_image_repo: "registry.aliyuncs.com/google_containers/pause-{{ image_arch }}"
  dnsautoscaler_image_repo: "mirrorgooglecontainers/cluster-proportional-autoscaler-{{ image_arch }}"
  tiller_image_repo: "mirrorgooglecontainers/kubernetes-helm/tiller"
  registry_proxy_image_repo: "registry.aliyuncs.com/google_containers/kube-registry-proxy"
  metrics_server_image_repo: "mirrorgooglecontainers/metrics-server-amd64"
  addon_resizer_image_repo: "mirrorgooglecontainers/addon-resizer"
  dashboard_image_repo: "registry.aliyuncs.com/google_containers/kubernetes-dashboard-{{ image_arch }}"

PLEASE HELP ME! THANK YOU VERY MUCH!

hzy46 commented 4 years ago

I think it is a problem with kubespray.

Could you please paste all logs from quick-start-kubespray.sh?

123toorc commented 4 years ago

Hi, thanks for help. here is the log, I paste the kubernetes play part, earlier one including the etcd and docker I believed it is correct. If you still need it I will be happy to do it. Thanks!!!

TASK [kubespray-defaults : Configure defaults] **********************************************************
Thursday 17 September 2020  10:50:11 +0800 (0:00:01.023)       0:04:39.798 **** 
ok: [gpu3.test.bbdops.com] => {
    "msg": "Check roles/kubespray-defaults/defaults/main.yml"
}
ok: [gpu2] => {
    "msg": "Check roles/kubespray-defaults/defaults/main.yml"
}

TASK [kubernetes/node : look up docker cgroup driver] ***************************************************
Thursday 17 September 2020  10:50:11 +0800 (0:00:00.277)       0:04:40.075 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : set standalone_kubelet fact] ****************************************************
Thursday 17 September 2020  10:50:12 +0800 (0:00:00.573)       0:04:40.649 **** 
ok: [gpu3.test.bbdops.com]
ok: [gpu2]

TASK [kubernetes/node : set kubelet_cgroup_driver_detected fact for containerd] *************************
Thursday 17 September 2020  10:50:12 +0800 (0:00:00.264)       0:04:40.914 **** 

TASK [kubernetes/node : set kubelet_cgroup_driver_detected fact for other engines] **********************
Thursday 17 September 2020  10:50:12 +0800 (0:00:00.161)       0:04:41.075 **** 
ok: [gpu3.test.bbdops.com]
ok: [gpu2]

TASK [kubernetes/node : os specific vars] ***************************************************************
Thursday 17 September 2020  10:50:12 +0800 (0:00:00.267)       0:04:41.342 **** 

TASK [kubernetes/node : Pre-upgrade | check if kubelet container exists] ********************************
Thursday 17 September 2020  10:50:13 +0800 (0:00:00.180)       0:04:41.523 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Pre-upgrade | copy /var/lib/cni from kubelet] ***********************************
Thursday 17 September 2020  10:50:13 +0800 (0:00:00.540)       0:04:42.064 **** 

TASK [kubernetes/node : Pre-upgrade | ensure kubelet container service is stopped if using host deployment] ***
Thursday 17 September 2020  10:50:13 +0800 (0:00:00.144)       0:04:42.209 **** 

TASK [kubernetes/node : Pre-upgrade | ensure kubelet container is removed if using host deployment] *****
Thursday 17 September 2020  10:50:13 +0800 (0:00:00.153)       0:04:42.363 **** 

TASK [kubernetes/node : Ensure /var/lib/cni exists] *****************************************************
Thursday 17 September 2020  10:50:14 +0800 (0:00:00.166)       0:04:42.529 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : install | Copy kubeadm binary from download dir] ********************************
Thursday 17 September 2020  10:50:14 +0800 (0:00:00.491)       0:04:43.021 **** 
ok: [gpu2 -> 10.28.100.162]

TASK [kubernetes/node : install | Set kubeadm binary permissions] ***************************************
Thursday 17 September 2020  10:50:15 +0800 (0:00:00.581)       0:04:43.603 **** 
ok: [gpu2]

TASK [kubernetes/node : install | Copy kubelet binary from download dir] ********************************
Thursday 17 September 2020  10:50:15 +0800 (0:00:00.483)       0:04:44.086 **** 
ok: [gpu3.test.bbdops.com -> 10.28.100.165]
ok: [gpu2 -> 10.28.100.162]

TASK [kubernetes/node : install | Set kubelet binary permissions] ***************************************
Thursday 17 September 2020  10:50:16 +0800 (0:00:00.559)       0:04:44.646 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : install | Copy hyperkube binary from download dir] ******************************
Thursday 17 September 2020  10:50:16 +0800 (0:00:00.484)       0:04:45.131 **** 
ok: [gpu3.test.bbdops.com -> 10.28.100.165]
ok: [gpu2 -> 10.28.100.162]

TASK [kubernetes/node : install | Set hyperkube binary permissions] *************************************
Thursday 17 September 2020  10:50:17 +0800 (0:00:00.547)       0:04:45.679 **** 
ok: [gpu3.test.bbdops.com]
ok: [gpu2]

TASK [kubernetes/node : install | Copy socat wrapper for Container Linux] *******************************
Thursday 17 September 2020  10:50:17 +0800 (0:00:00.442)       0:04:46.121 **** 

TASK [kubernetes/node : haproxy | Cleanup potentially deployed haproxy] *********************************
Thursday 17 September 2020  10:50:17 +0800 (0:00:00.149)       0:04:46.271 **** 
ok: [gpu2]

TASK [kubernetes/node : nginx-proxy | Make nginx directory] *********************************************
Thursday 17 September 2020  10:50:18 +0800 (0:00:00.420)       0:04:46.692 **** 
ok: [gpu2]

TASK [kubernetes/node : nginx-proxy | Write nginx-proxy configuration] **********************************
Thursday 17 September 2020  10:50:18 +0800 (0:00:00.432)       0:04:47.124 **** 
ok: [gpu2]

TASK [kubernetes/node : nginx-proxy | Get checksum from config] *****************************************
Thursday 17 September 2020  10:50:19 +0800 (0:00:00.909)       0:04:48.034 **** 
ok: [gpu2]

TASK [kubernetes/node : nginx-proxy | Write static pod] *************************************************
Thursday 17 September 2020  10:50:19 +0800 (0:00:00.451)       0:04:48.486 **** 
changed: [gpu2]

TASK [kubernetes/node : haproxy | Cleanup potentially deployed nginx-proxy] *****************************
Thursday 17 September 2020  10:50:20 +0800 (0:00:00.680)       0:04:49.166 **** 

TASK [kubernetes/node : haproxy | Make haproxy directory] ***********************************************
Thursday 17 September 2020  10:50:20 +0800 (0:00:00.158)       0:04:49.325 **** 

TASK [kubernetes/node : haproxy | Write haproxy configuration] ******************************************
Thursday 17 September 2020  10:50:20 +0800 (0:00:00.156)       0:04:49.481 **** 

TASK [kubernetes/node : haproxy | Get checksum from config] *********************************************
Thursday 17 September 2020  10:50:21 +0800 (0:00:00.158)       0:04:49.639 **** 

TASK [kubernetes/node : haproxy | Write static pod] *****************************************************
Thursday 17 September 2020  10:50:21 +0800 (0:00:00.156)       0:04:49.796 **** 

TASK [kubernetes/node : Ensure nodePort range is reserved] **********************************************
Thursday 17 September 2020  10:50:21 +0800 (0:00:00.151)       0:04:49.948 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Verify if br_netfilter module exists] *******************************************
Thursday 17 September 2020  10:50:22 +0800 (0:00:00.552)       0:04:50.501 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Verify br_netfilter module path exists] *****************************************
Thursday 17 September 2020  10:50:22 +0800 (0:00:00.454)       0:04:50.956 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Enable br_netfilter module] *****************************************************
Thursday 17 September 2020  10:50:22 +0800 (0:00:00.445)       0:04:51.402 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Persist br_netfilter module] ****************************************************
Thursday 17 September 2020  10:50:23 +0800 (0:00:00.656)       0:04:52.058 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Check if bridge-nf-call-iptables key exists] ************************************
Thursday 17 September 2020  10:50:24 +0800 (0:00:00.720)       0:04:52.779 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Enable bridge-nf-call tables] ***************************************************
Thursday 17 September 2020  10:50:24 +0800 (0:00:00.430)       0:04:53.210 **** 
ok: [gpu2] => (item=net.bridge.bridge-nf-call-iptables)
ok: [gpu3.test.bbdops.com] => (item=net.bridge.bridge-nf-call-iptables)
ok: [gpu2] => (item=net.bridge.bridge-nf-call-arptables)
ok: [gpu3.test.bbdops.com] => (item=net.bridge.bridge-nf-call-arptables)
ok: [gpu2] => (item=net.bridge.bridge-nf-call-ip6tables)
ok: [gpu3.test.bbdops.com] => (item=net.bridge.bridge-nf-call-ip6tables)

TASK [kubernetes/node : Modprode Kernel Module for IPVS] ************************************************
Thursday 17 September 2020  10:50:27 +0800 (0:00:03.070)       0:04:56.280 **** 
ok: [gpu2] => (item=ip_vs)
ok: [gpu3.test.bbdops.com] => (item=ip_vs)
ok: [gpu2] => (item=ip_vs_rr)
ok: [gpu3.test.bbdops.com] => (item=ip_vs_rr)
ok: [gpu2] => (item=ip_vs_wrr)
ok: [gpu2] => (item=ip_vs_sh)
ok: [gpu3.test.bbdops.com] => (item=ip_vs_wrr)
ok: [gpu3.test.bbdops.com] => (item=ip_vs_sh)

TASK [kubernetes/node : Modprobe nf_conntrack_ipv4 for kernels < 4.19] **********************************
Thursday 17 September 2020  10:50:29 +0800 (0:00:01.249)       0:04:57.529 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Modprobe nf_conntrack for kernels >= 4.19] **************************************
Thursday 17 September 2020  10:50:29 +0800 (0:00:00.567)       0:04:58.096 **** 

TASK [kubernetes/node : Persist ip_vs modules] **********************************************************
Thursday 17 September 2020  10:50:29 +0800 (0:00:00.152)       0:04:58.249 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Purge proxy manifest for kubeadm or if proxy services being provided by other means, e.g. network_plugin] ***
Thursday 17 September 2020  10:50:30 +0800 (0:00:00.856)       0:04:59.105 **** 

TASK [kubernetes/node : Cleanup kube-proxy leftovers from node] *****************************************
Thursday 17 September 2020  10:50:30 +0800 (0:00:00.151)       0:04:59.257 **** 

TASK [kubernetes/node : include_tasks] ******************************************************************
Thursday 17 September 2020  10:50:30 +0800 (0:00:00.141)       0:04:59.398 **** 

TASK [kubernetes/node : Write cacert file] **************************************************************
Thursday 17 September 2020  10:50:31 +0800 (0:00:00.151)       0:04:59.550 **** 

TASK [kubernetes/node : Write cloud-config] *************************************************************
Thursday 17 September 2020  10:50:31 +0800 (0:00:00.151)       0:04:59.701 **** 

TASK [kubernetes/node : Make sure dynamic kubelet configuration directory is writeable] *****************
Thursday 17 September 2020  10:50:31 +0800 (0:00:00.151)       0:04:59.852 **** 

TASK [kubernetes/node : gets the kubeadm version] *******************************************************
Thursday 17 September 2020  10:50:31 +0800 (0:00:00.174)       0:05:00.027 **** 
changed: [gpu2]
changed: [gpu3.test.bbdops.com]

TASK [kubernetes/node : sets kubeadm api version to v1beta1] ********************************************
Thursday 17 September 2020  10:50:32 +0800 (0:00:00.566)       0:05:00.594 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Write kubelet environment config file (kubeadm)] ********************************
Thursday 17 September 2020  10:50:32 +0800 (0:00:00.359)       0:05:00.954 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Write kubelet config file] ******************************************************
Thursday 17 September 2020  10:50:33 +0800 (0:00:01.025)       0:05:01.979 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Write kubelet systemd init file] ************************************************
Thursday 17 September 2020  10:50:34 +0800 (0:00:00.845)       0:05:02.825 **** 
ok: [gpu2]
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/node : Enable kubelet] *****************************************************************
Thursday 17 September 2020  10:50:35 +0800 (0:00:00.814)       0:05:03.640 **** 
changed: [gpu2]
changed: [gpu3.test.bbdops.com]

PLAY [kube-master] **************************************************************************************

TASK [download : prep_download | Set a few facts] *******************************************************
Thursday 17 September 2020  10:50:35 +0800 (0:00:00.675)       0:05:04.315 **** 

TASK [download : Set image info command for containerd] *************************************************
Thursday 17 September 2020  10:50:35 +0800 (0:00:00.094)       0:05:04.410 **** 

TASK [download : Register docker images info] ***********************************************************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.091)       0:05:04.502 **** 

TASK [download : prep_download | Create staging directory on remote node] *******************************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.087)       0:05:04.589 **** 

TASK [download : prep_download | Create local cache for files and images] *******************************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.088)       0:05:04.678 **** 

TASK [download : prep_download | On localhost, check if passwordless root is possible] ******************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.093)       0:05:04.771 **** 

TASK [download : prep_download | On localhost, check if user has access to docker without using sudo] ***
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.107)       0:05:04.879 **** 

TASK [download : prep_download | Parse the outputs of the previous commands] ****************************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.096)       0:05:04.976 **** 

TASK [download : prep_download | Check that local user is in group or can become root] ******************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.093)       0:05:05.069 **** 

TASK [download : Use cri-o for cri connection] **********************************************************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.091)       0:05:05.161 **** 

TASK [download : Use containerd for cri connetion] ******************************************************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.083)       0:05:05.244 **** 

TASK [download : Use docker for cri connetion] **********************************************************
Thursday 17 September 2020  10:50:36 +0800 (0:00:00.090)       0:05:05.335 **** 
ok: [gpu3.test.bbdops.com]

TASK [download : include_tasks] *************************************************************************
Thursday 17 September 2020  10:50:37 +0800 (0:00:00.215)       0:05:05.550 **** 

TASK [download : download | Get kubeadm binary and list of required images] *****************************
Thursday 17 September 2020  10:50:37 +0800 (0:00:00.113)       0:05:05.664 **** 

TASK [download : download | Download files / images] ****************************************************
Thursday 17 September 2020  10:50:37 +0800 (0:00:00.096)       0:05:05.760 **** 

TASK [download : download | Sync files / images from ansible host to nodes] *****************************
Thursday 17 September 2020  10:50:38 +0800 (0:00:00.963)       0:05:06.723 **** 

TASK [kubespray-defaults : Configure defaults] **********************************************************
Thursday 17 September 2020  10:50:39 +0800 (0:00:00.964)       0:05:07.687 **** 
ok: [gpu3.test.bbdops.com] => {
    "msg": "Check roles/kubespray-defaults/defaults/main.yml"
}

TASK [kubernetes/tokens : Check_tokens | check if the tokens have already been generated on first master] ***
Thursday 17 September 2020  10:50:39 +0800 (0:00:00.257)       0:05:07.945 **** 

TASK [kubernetes/tokens : Check_tokens | Set default value for 'sync_tokens' and 'gen_tokens' to false] ***
Thursday 17 September 2020  10:50:39 +0800 (0:00:00.088)       0:05:08.033 **** 

TASK [kubernetes/tokens : Check_tokens | Set 'sync_tokens' and 'gen_tokens' to true] ********************
Thursday 17 September 2020  10:50:39 +0800 (0:00:00.097)       0:05:08.131 **** 

TASK [kubernetes/tokens : Check tokens | check if a cert already exists] ********************************
Thursday 17 September 2020  10:50:39 +0800 (0:00:00.088)       0:05:08.219 **** 

TASK [kubernetes/tokens : Check_tokens | Set 'sync_tokens' to true] *************************************
Thursday 17 September 2020  10:50:39 +0800 (0:00:00.087)       0:05:08.307 **** 

TASK [kubernetes/tokens : Make sure the tokens directory exits] *****************************************
Thursday 17 September 2020  10:50:39 +0800 (0:00:00.087)       0:05:08.395 **** 

TASK [kubernetes/tokens : Gen_tokens | copy tokens generation script] ***********************************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.137)       0:05:08.533 **** 

TASK [kubernetes/tokens : Gen_tokens | generate tokens for master components] ***************************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.169)       0:05:08.702 **** 

TASK [kubernetes/tokens : Gen_tokens | generate tokens for node components] *****************************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.152)       0:05:08.855 **** 

TASK [kubernetes/tokens : Gen_tokens | Get list of tokens from first master] ****************************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.165)       0:05:09.020 **** 

TASK [kubernetes/tokens : Gen_tokens | Gather tokens] ***************************************************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.143)       0:05:09.163 **** 

TASK [kubernetes/tokens : Gen_tokens | Copy tokens on masters] ******************************************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.102)       0:05:09.266 **** 

TASK [kubernetes/master : Pre-upgrade | Delete master manifests if etcd secrets changed] ****************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.088)       0:05:09.355 **** 

TASK [kubernetes/master : Pre-upgrade | Delete master containers forcefully] ****************************
Thursday 17 September 2020  10:50:40 +0800 (0:00:00.129)       0:05:09.484 **** 

TASK [kubernetes/master : Make sure the users directory exits] ******************************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.129)       0:05:09.613 **** 

TASK [kubernetes/master : Populate users for basic auth in API] *****************************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.090)       0:05:09.704 **** 

TASK [kubernetes/master : Create webhook token auth config] *********************************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.097)       0:05:09.801 **** 

TASK [kubernetes/master : Check if secret for encrypting data at rest already exist] ********************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.092)       0:05:09.894 **** 

TASK [kubernetes/master : Slurp secrets_encryption file if it exists] ***********************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.087)       0:05:09.982 **** 

TASK [kubernetes/master : Base 64 Decode slurped secrets_encryption.yaml file] **************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.092)       0:05:10.074 **** 

TASK [kubernetes/master : Extract secret value from secrets_encryption.yaml] ****************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.082)       0:05:10.157 **** 

TASK [kubernetes/master : Set kube_encrypt_token across master nodes] ***********************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.145)       0:05:10.303 **** 

TASK [kubernetes/master : Write secrets for encrypting secret data at rest] *****************************
Thursday 17 September 2020  10:50:41 +0800 (0:00:00.090)       0:05:10.393 **** 

TASK [kubernetes/master : Install | Copy kubectl binary from download dir] ******************************
Thursday 17 September 2020  10:50:42 +0800 (0:00:00.146)       0:05:10.540 **** 
ok: [gpu3.test.bbdops.com -> 10.28.100.165]

TASK [kubernetes/master : install | Set kubectl binary permissions] *************************************
Thursday 17 September 2020  10:50:42 +0800 (0:00:00.545)       0:05:11.086 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : Install kubectl bash completion] **********************************************
Thursday 17 September 2020  10:50:43 +0800 (0:00:00.418)       0:05:11.505 **** 
changed: [gpu3.test.bbdops.com]

TASK [kubernetes/master : Set kubectl bash completion file permissions] *********************************
Thursday 17 September 2020  10:50:43 +0800 (0:00:00.627)       0:05:12.132 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : Disable SecurityContextDeny admission-controller and enable PodSecurityPolicy] ***
Thursday 17 September 2020  10:50:44 +0800 (0:00:00.427)       0:05:12.559 **** 

TASK [kubernetes/master : kubeadm | Check if old apiserver cert exists on host] *************************
Thursday 17 September 2020  10:50:44 +0800 (0:00:00.146)       0:05:12.706 **** 
ok: [gpu3.test.bbdops.com -> 10.28.100.165]

TASK [kubernetes/master : Copy old certs to the kubeadm expected path] **********************************
Thursday 17 September 2020  10:50:44 +0800 (0:00:00.416)       0:05:13.122 **** 

TASK [kubernetes/master : Install OIDC certificate] *****************************************************
Thursday 17 September 2020  10:50:44 +0800 (0:00:00.250)       0:05:13.373 **** 

TASK [kubernetes/master : kubeadm | Check serviceaccount key] *******************************************
Thursday 17 September 2020  10:50:44 +0800 (0:00:00.096)       0:05:13.469 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : kubeadm | Check if kubeadm has already run] ***********************************
Thursday 17 September 2020  10:50:45 +0800 (0:00:00.416)       0:05:13.885 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : kubeadm | Delete old admin.conf] **********************************************
Thursday 17 September 2020  10:50:45 +0800 (0:00:00.416)       0:05:14.301 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : kubeadm | Delete old static pods] *********************************************
Thursday 17 September 2020  10:50:46 +0800 (0:00:00.421)       0:05:14.723 **** 

TASK [kubernetes/master : kubeadm | Forcefully delete old static pods] **********************************
Thursday 17 September 2020  10:50:46 +0800 (0:00:00.139)       0:05:14.863 **** 

TASK [kubernetes/master : kubeadm | aggregate all SANs] *************************************************
Thursday 17 September 2020  10:50:46 +0800 (0:00:00.116)       0:05:14.979 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : Create audit-policy directory] ************************************************
Thursday 17 September 2020  10:50:47 +0800 (0:00:00.589)       0:05:15.569 **** 

TASK [kubernetes/master : Write api audit policy yaml] **************************************************
Thursday 17 September 2020  10:50:47 +0800 (0:00:00.090)       0:05:15.660 **** 

TASK [kubernetes/master : set kubeadm_config_api_fqdn define] *******************************************
Thursday 17 September 2020  10:50:47 +0800 (0:00:00.087)       0:05:15.748 **** 

TASK [kubernetes/master : gets the kubeadm version] *****************************************************
Thursday 17 September 2020  10:50:47 +0800 (0:00:00.090)       0:05:15.839 **** 
changed: [gpu3.test.bbdops.com]

TASK [kubernetes/master : sets kubeadm api version to v1beta1] ******************************************
Thursday 17 September 2020  10:50:47 +0800 (0:00:00.472)       0:05:16.312 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : sets kubeadm api version to v1beta2] ******************************************
Thursday 17 September 2020  10:50:48 +0800 (0:00:00.204)       0:05:16.516 **** 
ok: [gpu3.test.bbdops.com]

TASK [kubernetes/master : kubeadm | Create kubeadm config] **********************************************
Thursday 17 September 2020  10:50:48 +0800 (0:00:00.205)       0:05:16.721 **** 
changed: [gpu3.test.bbdops.com]

TASK [kubernetes/master : Backup old certs and keys] ****************************************************
Thursday 17 September 2020  10:50:49 +0800 (0:00:01.395)       0:05:18.117 **** 

TASK [kubernetes/master : kubeadm | Initialize first master] ********************************************
Thursday 17 September 2020  10:50:49 +0800 (0:00:00.186)       0:05:18.303 **** 
FAILED - RETRYING: kubeadm | Initialize first master (3 retries left).
123toorc commented 4 years ago

Hi Zhiyuan,

From the kubelet service journal log below, I feel it is like kube-apiserver is not running, which I do have the docker image, but it never run. Thanks.

Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: I0917 11:26:56.746230   32353 kubelet.go:307] Watching apiserver
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: E0917 11:26:56.747305   32353 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:454: Failed to list *v1.Node: Get https://10.28.100.165:6443/api/v1/nodes?fieldSelector=metadata.name%3Dgpu3.test.bbdops.com&limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: E0917 11:26:56.747306   32353 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:445: Failed to list *v1.Service: Get https://10.28.100.165:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: E0917 11:26:56.747305   32353 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.28.100.165:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dgpu3.test.bbdops.com&limit=500&resourceVersion=0: dial tcp 10.28.100.165:6443: connect: connection refused
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: I0917 11:26:56.748726   32353 client.go:75] Connecting to docker on unix:///var/run/docker.sock
Sep 17 11:26:56 gpu3.test.bbdops.com kubelet[32353]: I0917 11:26:56.748750   32353 client.go:104] Start docker client with request timeout=2m0s
hzy46 commented 4 years ago

I found exactly same problem here: https://github.com/kubernetes/kubernetes/issues/79046

Kubernetes 1.14.x adds pid limit and it requires your kernel to support it. I'm afraid the Linux kernel on your master node doesn't support pid control in cgroups. To confirm, could you please paste the result of cat /proc/cgroups on your master node?

123toorc commented 4 years ago

Amazing, if it is kernel issue.. That will be super easy to fix in my opinion...

$ cat /proc/cgroups 
#subsys_name    hierarchy   num_cgroups enabled
cpuset  6   39  1
cpu 2   136 1
cpuacct 2   136 1
blkio   10  136 1
memory  4   427 1
devices 8   136 1
freezer 3   39  1
net_cls 5   39  1
perf_event  9   39  1
net_prio    5   39  1
hugetlb 7   39  1
hzy46 commented 4 years ago

You can see you don't have a pid cgroups like the following:

image

I suggest:

  1. On the dev box machine, go to ~/pai-deploy/kubespray, run ansible-playbook -i inventory/pai/hosts.yml reset.yml --become --become-user=root -e "@inventory/pai/openpai.yml" to remove current broken k8s totally.

  2. Fix this pid cgroups issue on all your worker and master nodes.

  3. Have a retry.

123toorc commented 4 years ago

Thanks! will do, This is the root cause right? I have a question and I just curious about it, since from my point of view, I feel it first can't connect to the apiserver then prompt the cgroup issue. Would you please tell me some detail relate to this? Thank you very much!

hzy46 commented 4 years ago

In my opinion, API server connection problem is not the root cause. Sometimes, api server is down, and kubelet will just try to connect to it periodically. In other words, api server connection problem won't cause a fatal error in kubelet. Of course, maybe I'm wrong and you can dive into Kubernetes' source code to find out more :-)

123toorc commented 4 years ago

Thank you! Super appreciated! Plus I'm not good at swimming in the code sea :D

oldthreefeng commented 4 years ago

@123toorc hello, do you know how to fix pid cgroups , i met the same situation...

123toorc commented 4 years ago

@123toorc hello, do you know how to fix pid cgroups , i met the same situation...

Yes, kindly update your kernel or enable the feature if your kernel is latest. The pids feature was introduced in vanilla kernel>=4.3

zhangguanzhang commented 3 years ago

@hzy46 @123toorc 两位好,我发现似乎和内核无关,我在7.2, 7.6, 7.8上查看/proc/cgroups 大多数没有升级过内核的机器上都有pids,目前只有一台Centos7.8的机器上没这个pids,暂时没找到原因

123toorc commented 3 years ago

你要看linux kernel 的config, /boot/configxxxxx, 或者 /proc/config.bz 看看CONFIG_CGROUP_PIDS. 可能没有enable。

/proc 接口没有的话,大概率是内核没有支持这个feature吧。

zhangguanzhang notifications@github.com 于2020年10月20日周二 下午12:31写道:

@hzy46 https://github.com/hzy46 @123toorc https://github.com/123toorc 两位好,我发现似乎和内核无关,我在7.2, 7.6, 7.8上查看/proc/cgroups 大多数没有升级过内核的机器上都有pids,目前只有一台Centos7.8的机器上没这个pids,暂时没找到原因

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/pai/issues/4909#issuecomment-712581428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPH47EID25PLVWA6BCFSR3SLUHCHANCNFSM4ROC66MA .

zhangguanzhang commented 3 years ago
# grep PIDS /boot/config-3.10.0-1062.el7.x86_64 
CONFIG_CGROUP_PIDS=y

手动加这个能开启吗

123toorc commented 3 years ago

Eh.你的kernel是3.10的? 不对,centos 7.8很新。我不是很清楚centos,嗯,你要去看kernel version,如果config没有,可能是cgroup没打开。不然会是 config is not set. 手动加,要重新编译内核。

zhangguanzhang notifications@github.com 于 2020年10月20日周二 12:52写道:

grep PIDS /boot/config-3.10.0-1062.el7.x86_64

CONFIG_CGROUP_PIDS=y

手动加这个能开启吗

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/pai/issues/4909#issuecomment-712587054, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPH47HACRMRB5SEBDIAPJLSLUJQNANCNFSM4ROC66MA .

zhangguanzhang commented 3 years ago

centos是backport,和github的kernel tag不一样,是红帽自己维护的,centos7都是3.10多的内核,config文件就上面那样里面。 我知道了,我这机器的内核被人升级过了,没这个

poetryben88 commented 3 years ago

CONFIG_CGROUP_PIDS

_root@fxkj:/boot# cat config-5.4.0-42-generic | grep CONFIG_CGROUP_PIDS CONFIG_CGROUPPIDS=y

_root@fxkj:/boot# cat /proc/cgroups

subsys_name hierarchy num_cgroups enabled

cpuset 5 14 1 cpu 4 147 1 cpuacct 4 147 1 blkio 8 147 1 memory 7 286 1 devices 9 147 1 freezer 6 14 1 net_cls 10 14 1 perf_event 3 14 1 netprio 10 14 1 hugetlb 11 14 1 pids 2 192 1 rdma 12 1 1

@123toorc 我的pid 是打开的,但是也出现了你这个问题,请问要怎么解决呢,感谢!

/bin/bash quick-start-kubespray.sh 报错: _TASK [kubernetes/master : kubeadm | Initialize first master] *** Saturday 09 October 2021 14:54:46 +0800 (0:00:00.106) 0:03:29.453 ** FAILED - RETRYING: kubeadm | Initialize first master (3 retries left). FAILED - RETRYING: kubeadm | Initialize first master (2 retries left). FAILED - RETRYING: kubeadm | Initialize first master (1 retries left). fatal: [pai-master]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["timeout", "-k", "300s", "300s", "/usr/local/bin/kubeadm", "init", "--config=/etc/kubernetes/kubeadm-config.yaml", "--ignore-preflight-errors=all", "--skip-phases=addon/coredns", "--upload-certs"], "delta": "0:05:00.008053", "end": "2021-10-09 15:15:02.734350", "failed_whenresult": true, "msg": "non-zero return code", "rc": 124, "start": "2021-10-09 15:10:02.726297", "stderr"

zhangguanzhang commented 3 years ago

手动执行这个命令看看输出

poetryben88 commented 3 years ago

手动执行这个命令看看输出 已经找到问题了,是仓库配置有的镜像没有pull下来,导致某些镜像没有顺序启动,这个报错牛头不对马嘴,搞了我一天。