microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

kubernete cluster bootup failed - kubectl ready test failed #4629

Closed victorming closed 4 years ago

victorming commented 4 years ago

Organization Name: EduCloud

Short summary about the issue/question: Kubernetes cluster bootup failed with reponse "kubectl ready test failed. Exit paictl." Brief what process you are following: This error occured when I bootup the kubernets cluster: python paictl.py cluter k8s-bootup -p ./pai-config

I checked the pai/deployment/k8slibrary/maintainlib/kubectl_install.py and found it execute the script 'kubectl_install.sh'. The curl -OL kubectl command from googleapi failed due to GWF issue. I downloaded kubectl and put it on the dev-box. Thus the kubectl check responsed 'kubectl exists and skipped'. However, it failed when execute 'kubectl get nod'.

OpenPAI Environment: Ubuntu 16.04

Anything else we need to know:

victorming commented 4 years ago

The log from dev-box: 2020-06-17 08:57:36,102 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Execute the script to install kubectl on your host! kubectl has been installed. Skip this precess 2020-06-17 08:57:36,117 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Successfully install kubectl on the dev-box. 2020-06-17 08:57:36,118 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Generate the configuation file of kubectl. 2020-06-17 08:57:36,118 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Cluster configuration is detected. 2020-06-17 08:57:36,118 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Generate the KUBECONIFG based on the cluster configuration. 2020-06-17 08:57:36,120 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Successfully configure kubeconfig in the dev-box. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 08:58:36,486 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 08:58:36,487 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 08:59:41,560 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 08:59:41,561 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:00:46,637 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:00:46,638 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:01:51,707 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:01:51,708 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:02:56,779 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:02:56,780 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:04:01,886 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:04:01,887 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:05:06,958 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:05:06,960 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:06:12,045 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:06:12,046 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:07:17,110 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:07:17,111 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:08:22,202 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:08:22,203 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:09:27,611 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:09:27,613 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later.

ydye commented 4 years ago

1: Paste your kubernetes-configuration.yaml and services-configuration.yaml 2: Paste the output of these commands in your master node. sudo docker ps -a and sudo docker logs kubelet 3: Please paste with markdown. Raw text is hard too debug.

victorming commented 4 years ago

kubernetes-configuration.yaml:

kubernetes:
  # Find the namesever in  /etc/resolv.conf
  cluster-dns: 127.0.1.1
  # To support k8s ha, you should set an lb address here.
  # If deploy k8s with single master node, please set master IP address here
  load-balance-ip: 192.168.101.241

  # specify an IP range not in the same network segment with the host machine.
  service-cluster-ip-range: 10.254.0.0/16
  # According to the etcdversion, you should fill a corresponding backend name.
  # If you are not familiar with etcd, please don't change it.
  storage-backend: etcd3
  # The docker registry used in the k8s deployment. If you can access to gcr, we suggest to use gcr.
  #docker-registry: gcr.io/google_containers
  docker-registry: mirrorgooglecontainers
  # http://gcr.io/google_containers/hyperkube. Or the tag in your registry.
  hyperkube-version: v1.9.9
  # http://gcr.io/google_containers/etcd. Or the tag in your registry.
  # If you are not familiar with etcd, please don't change it.
  etcd-version: 3.2.17
  # http://gcr.io/google_containers/kube-apiserver. Or the tag in your registry.
  apiserver-version: v1.9.9
  # http://gcr.io/google_containers/kube-scheduler. Or the tag in your registry.
  kube-scheduler-version: v1.9.9
  # http://gcr.io/google_containers/kube-controller-manager
  kube-controller-manager-version:  v1.9.9
  # http://gcr.io/google_containers/kubernetes-dashboard-amd64
  dashboard-version: v1.8.3
  # The path to storage etcd data.
  etcd-data-path: "/var/etcd"

  # #Enable QoS feature for k8s or not. Default value is "true"
  # qos-switch: "true"

service-configuration.yaml:

cluster:
  #common:
  #  cluster-id: pai-example
  #
  #  # HDFS, zookeeper data path on your cluster machine.
  #  data-path: "/datastorage"

  # the docker registry to store docker images that contain system services like frameworklauncher, hadoop, etc.
  docker-registry:

   # The namespace in your registry. If the registry is docker.io, the namespace will be your user account.
   namespace: openpai

   # E.g., gcr.io.
   # if the registry is hub.docker, please fill this value with docker.io
   #domain: docker.io
   domain: docker.io
   # If the docker registry doesn't require authentication, please comment username and password
   # username: 
   # password: 

   tag: v0.14.0

   # The name of the secret in kubernetes will be created in your cluster
   # Must be lower case, e.g., regsecret.
   secret-name: zs123456
#Uncomment following lines if you want to customize yarn
#hadoop-resource-manager:
#  # job log retain time
#  yarn_log_retain_seconds: 2592000
#  # port for yarn exporter
#  yarn_exporter_port: 9459

#Uncomment following lines if you want to customize hdfs
#hadoop-data-node:
#  # storage path for hdfs, support comma-delimited list of directories, eg. 
/path/to/folder1,/path/to/folder2 ...
#  # if left empty, will use cluster.common.data-path/hdfs/data
#  storage_path:

# uncomment following if you want to change customeize yarn-frameworklauncher
#yarn-frameworklauncher:
#  frameworklauncher-port: 9086

rest-server:
   # database admin username
   default-pai-admin-username: admin
   # database admin password
   default-pai-admin-password: adminok

# uncomment following section if you want to customize the port of web portal
# webportal:
#   server-port: 9286

# uncomment following if you want to change customeize grafana
# grafana:
#   port: 3000

# uncomment following if you want to change customeize drivers
drivers:
    set-nvidia-runtime: false 
    #  # You can set drivers version here. If this value is miss, default value will be 384.111
    #  # Current supported version list
    #  # 384.111
    #  # 390.25
    #  # 410.73
    #  # 418.56
    version: "410.73"
    pre-installed-nvidia-path: /var/drivers/nvidia/410.73

  # uncomment following if you want node-exporter listen to different port
  # node-exporter:
  #   port: 9100

  # uncomment following if you want to customeize job-exporter
  # job-exporter:
  #   port: 9102
  #   logging-level: INFO
  #   interface: eth0,eno2

  # if you want to enable alert manager to send alert email, uncomment following lines and fill
  # the right values.
  # alert-manager:
     #   receiver: your_addr@example.com
     #   smtp_url: smtp.office365.com:587
     #   smtp_from: alert_sender@example.com
     #   smtp_auth_username: alert_sender@example.com
     #   smtp_auth_password: password_for_alert_sender
     #   port: 9093 # this is optional, you should not write this if you do not want to change the port 
   alert-manager is listening on

   # uncomment following if you want to change customeize prometheus
   # prometheus:
     #   port: 9091
     #   # How frequently to scrape targets
     #   scrape_interval: 30
     #
     #   # if you want to use key file to login nodes
     #   

    # uncomment following section if you want to customize the port of pylon
    # pylon:
    #  port: 80

   # uncomment following section if you want to customize the threshold of cleaner
   # cleaner:
   #  threshold: 94
   #  interval: 60

docker ps -a

CONTAINER ID        IMAGE                                            COMMAND                  CREATED             STATUS              PORTS               NAMES
277b5e19d670        mirrorgooglecontainers/kube-scheduler            "/usr/local/bin/kube…"   4 hours ago         Up 4 hours                              k8s_kube-scheduler_kube-scheduler-192.168.101.241_kube-system_6257a0e31e71410babc94f4a2507482b_4
2f64cec52a90        mirrorgooglecontainers/kube-controller-manager   "/usr/local/bin/kube…"   4 hours ago         Up 4 hours                              k8s_kube-controller-manager_kube-controller-manager-192.168.101.241_kube-system_e86cc1850ab01d3d1111549de6ad2658_4
047714a77123        mirrorgooglecontainers/kube-apiserver            "/usr/local/bin/kube…"   4 hours ago         Up 4 hours                              k8s_apiserver-container_kube-apiserver-192.168.101.241_kube-system_cc6d87d35ec06282108492119a40a44e_4
19f472ad4c4e        mirrorgooglecontainers/pause-amd64:3.0           "/pause"                 4 hours ago         Up 4 hours                              k8s_POD_etcd-server-192.168.101.241_default_2633ab100382172b5691a1f06cd48c24_4
06b0cf02d15a        mirrorgooglecontainers/pause-amd64:3.0           "/pause"                 4 hours ago         Up 4 hours                              k8s_POD_kube-apiserver-192.168.101.241_kube-system_cc6d87d35ec06282108492119a40a44e_4
ac7a869be34b        mirrorgooglecontainers/pause-amd64:3.0           "/pause"                 4 hours ago         Up 4 hours                              k8s_POD_kube-scheduler-192.168.101.241_kube-system_6257a0e31e71410babc94f4a2507482b_4
767682562fc4        mirrorgooglecontainers/pause-amd64:3.0           "/pause"                 4 hours ago         Up 4 hours                              k8s_POD_kube-controller-manager-192.168.101.241_kube-system_e86cc1850ab01d3d1111549de6ad2658_4
894532d8dc17        mirrorgooglecontainers/hyperkube:v1.9.9          "/hyperkube kubelet …"   2 days ago          Up 4 hours                              kubelet

docker logs kubelet

E0620 06:52:31.290332    2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
I0620 06:52:36.475720    2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:52:36.533373    2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:52:36.533436    2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:52:36.533451    2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:52:36.533483    2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:52:41.290606    2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
E0620 06:52:45.551364    2894 kubelet_node_status.go:106] Unable to register node "192.168.101.241" with API server: Unable to refresh the Webhook configuration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io)
I0620 06:52:46.533153    2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:52:46.589720    2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:52:46.589766    2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:52:46.589783    2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
E0620 06:52:47.575051    2894 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"192.168.101.241.161a211de50109bb", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"192.168.101.241", UID:"192.168.101.241", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node 192.168.101.241 status is now: NodeHasSufficientDisk", Source:v1.EventSource{Component:"kubelet", Host:"192.168.101.241"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfb37ca19e524dbb, ext:5998866895, loc:(*time.Location)(0xab3aa60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfb38a68622bb171, ext:14113063445387, loc:(*time.Location)(0xab3aa60)}}, Count:2157, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
E0620 06:52:51.290866    2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
I0620 06:52:52.551622    2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:52:52.592016    2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:52:52.592077    2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:52:52.592101    2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:52:52.592137    2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:53:01.291156    2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
E0620 06:53:01.609558    2894 kubelet_node_status.go:106] Unable to register node "192.168.101.241" with API server: Unable to refresh the Webhook configuration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io)
I0620 06:53:08.609824    2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:53:08.665122    2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:53:08.665197    2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:53:08.665219    2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:53:08.665246    2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:53:11.291412    2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
E0620 06:53:17.682489    2894 kubelet_node_status.go:106] Unable to register node "192.168.101.241" with API server: Unable to refresh the Webhook configuration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io)
E0620 06:53:21.291740    2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
I0620 06:53:24.682678    2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:53:24.736275    2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:53:24.736316    2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:53:24.736332    2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:53:24.736362    2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:53:28.578268    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
E0620 06:53:28.579241    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:482: Failed to list *v1.Node: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
E0620 06:53:28.580317    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:473: Failed to list *v1.Service: the server was unable to return a response in the time allotted, but may still be processing the request (get services)
victorming commented 4 years ago

@ydye, is't possible that the etcd container is not running which caused this error?

victorming commented 4 years ago

I checked the logs of etcd container and get following: 2020-06-20 07:43:36.830391 I | etcdmain: etcd Version: 3.2.17 2020-06-20 07:43:36.830465 I | etcdmain: Git SHA: 28c47bb2f 2020-06-20 07:43:36.830488 I | etcdmain: Go Version: go1.8.7 2020-06-20 07:43:36.830495 I | etcdmain: Go OS/Arch: linux/amd64 2020-06-20 07:43:36.830502 I | etcdmain: setting maximum number of CPUs to 12, total number of available CPUs is 12 2020-06-20 07:43:36.830562 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2020-06-20 07:43:36.830637 I | embed: listening for peers on http://0.0.0.0:2380 2020-06-20 07:43:36.830683 I | embed: listening for client requests on 0.0.0.0:4001 2020-06-20 07:43:36.833184 I | etcdserver: recovered store from snapshot at index 9100091 2020-06-20 07:43:36.837011 I | mvcc: restore compact to 8869249 panic: runtime error: index out of range

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).pageNode(0xc420053f40, 0x613123d463333321, 0x7f78d93d7000, 0x0)

/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

victorming commented 4 years ago

I got the root cause: it's the etcd container failure and I remove the directory '/var/etcd/data' and k8s-bootup again. It's ok. I think maybe you forget to cleanup the etcd data in your k8s-clean scripts.