Closed victorming closed 4 years ago
The log from dev-box: 2020-06-17 08:57:36,102 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Execute the script to install kubectl on your host! kubectl has been installed. Skip this precess 2020-06-17 08:57:36,117 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Successfully install kubectl on the dev-box. 2020-06-17 08:57:36,118 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Generate the configuation file of kubectl. 2020-06-17 08:57:36,118 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Cluster configuration is detected. 2020-06-17 08:57:36,118 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Generate the KUBECONIFG based on the cluster configuration. 2020-06-17 08:57:36,120 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Successfully configure kubeconfig in the dev-box. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 08:58:36,486 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 08:58:36,487 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 08:59:41,560 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 08:59:41,561 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:00:46,637 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:00:46,638 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:01:51,707 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:01:51,708 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:02:56,779 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:02:56,780 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:04:01,886 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:04:01,887 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:05:06,958 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:05:06,960 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:06:12,045 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:06:12,046 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:07:17,110 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:07:17,111 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:08:22,202 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:08:22,203 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) 2020-06-17 09:09:27,611 [WARNING] - deployment.k8sPaiLibrary.maintainlib.common : There will be a delay after installing, please wait. 2020-06-17 09:09:27,613 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Wait 5s, and retry it later.
1: Paste your kubernetes-configuration.yaml and services-configuration.yaml
2: Paste the output of these commands in your master node. sudo docker ps -a
and sudo docker logs kubelet
3: Please paste with markdown. Raw text is hard too debug.
kubernetes:
# Find the namesever in /etc/resolv.conf
cluster-dns: 127.0.1.1
# To support k8s ha, you should set an lb address here.
# If deploy k8s with single master node, please set master IP address here
load-balance-ip: 192.168.101.241
# specify an IP range not in the same network segment with the host machine.
service-cluster-ip-range: 10.254.0.0/16
# According to the etcdversion, you should fill a corresponding backend name.
# If you are not familiar with etcd, please don't change it.
storage-backend: etcd3
# The docker registry used in the k8s deployment. If you can access to gcr, we suggest to use gcr.
#docker-registry: gcr.io/google_containers
docker-registry: mirrorgooglecontainers
# http://gcr.io/google_containers/hyperkube. Or the tag in your registry.
hyperkube-version: v1.9.9
# http://gcr.io/google_containers/etcd. Or the tag in your registry.
# If you are not familiar with etcd, please don't change it.
etcd-version: 3.2.17
# http://gcr.io/google_containers/kube-apiserver. Or the tag in your registry.
apiserver-version: v1.9.9
# http://gcr.io/google_containers/kube-scheduler. Or the tag in your registry.
kube-scheduler-version: v1.9.9
# http://gcr.io/google_containers/kube-controller-manager
kube-controller-manager-version: v1.9.9
# http://gcr.io/google_containers/kubernetes-dashboard-amd64
dashboard-version: v1.8.3
# The path to storage etcd data.
etcd-data-path: "/var/etcd"
# #Enable QoS feature for k8s or not. Default value is "true"
# qos-switch: "true"
cluster:
#common:
# cluster-id: pai-example
#
# # HDFS, zookeeper data path on your cluster machine.
# data-path: "/datastorage"
# the docker registry to store docker images that contain system services like frameworklauncher, hadoop, etc.
docker-registry:
# The namespace in your registry. If the registry is docker.io, the namespace will be your user account.
namespace: openpai
# E.g., gcr.io.
# if the registry is hub.docker, please fill this value with docker.io
#domain: docker.io
domain: docker.io
# If the docker registry doesn't require authentication, please comment username and password
# username:
# password:
tag: v0.14.0
# The name of the secret in kubernetes will be created in your cluster
# Must be lower case, e.g., regsecret.
secret-name: zs123456
#Uncomment following lines if you want to customize yarn
#hadoop-resource-manager:
# # job log retain time
# yarn_log_retain_seconds: 2592000
# # port for yarn exporter
# yarn_exporter_port: 9459
#Uncomment following lines if you want to customize hdfs
#hadoop-data-node:
# # storage path for hdfs, support comma-delimited list of directories, eg.
/path/to/folder1,/path/to/folder2 ...
# # if left empty, will use cluster.common.data-path/hdfs/data
# storage_path:
# uncomment following if you want to change customeize yarn-frameworklauncher
#yarn-frameworklauncher:
# frameworklauncher-port: 9086
rest-server:
# database admin username
default-pai-admin-username: admin
# database admin password
default-pai-admin-password: adminok
# uncomment following section if you want to customize the port of web portal
# webportal:
# server-port: 9286
# uncomment following if you want to change customeize grafana
# grafana:
# port: 3000
# uncomment following if you want to change customeize drivers
drivers:
set-nvidia-runtime: false
# # You can set drivers version here. If this value is miss, default value will be 384.111
# # Current supported version list
# # 384.111
# # 390.25
# # 410.73
# # 418.56
version: "410.73"
pre-installed-nvidia-path: /var/drivers/nvidia/410.73
# uncomment following if you want node-exporter listen to different port
# node-exporter:
# port: 9100
# uncomment following if you want to customeize job-exporter
# job-exporter:
# port: 9102
# logging-level: INFO
# interface: eth0,eno2
# if you want to enable alert manager to send alert email, uncomment following lines and fill
# the right values.
# alert-manager:
# receiver: your_addr@example.com
# smtp_url: smtp.office365.com:587
# smtp_from: alert_sender@example.com
# smtp_auth_username: alert_sender@example.com
# smtp_auth_password: password_for_alert_sender
# port: 9093 # this is optional, you should not write this if you do not want to change the port
alert-manager is listening on
# uncomment following if you want to change customeize prometheus
# prometheus:
# port: 9091
# # How frequently to scrape targets
# scrape_interval: 30
#
# # if you want to use key file to login nodes
#
# uncomment following section if you want to customize the port of pylon
# pylon:
# port: 80
# uncomment following section if you want to customize the threshold of cleaner
# cleaner:
# threshold: 94
# interval: 60
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
277b5e19d670 mirrorgooglecontainers/kube-scheduler "/usr/local/bin/kube…" 4 hours ago Up 4 hours k8s_kube-scheduler_kube-scheduler-192.168.101.241_kube-system_6257a0e31e71410babc94f4a2507482b_4
2f64cec52a90 mirrorgooglecontainers/kube-controller-manager "/usr/local/bin/kube…" 4 hours ago Up 4 hours k8s_kube-controller-manager_kube-controller-manager-192.168.101.241_kube-system_e86cc1850ab01d3d1111549de6ad2658_4
047714a77123 mirrorgooglecontainers/kube-apiserver "/usr/local/bin/kube…" 4 hours ago Up 4 hours k8s_apiserver-container_kube-apiserver-192.168.101.241_kube-system_cc6d87d35ec06282108492119a40a44e_4
19f472ad4c4e mirrorgooglecontainers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_etcd-server-192.168.101.241_default_2633ab100382172b5691a1f06cd48c24_4
06b0cf02d15a mirrorgooglecontainers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_kube-apiserver-192.168.101.241_kube-system_cc6d87d35ec06282108492119a40a44e_4
ac7a869be34b mirrorgooglecontainers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_kube-scheduler-192.168.101.241_kube-system_6257a0e31e71410babc94f4a2507482b_4
767682562fc4 mirrorgooglecontainers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_kube-controller-manager-192.168.101.241_kube-system_e86cc1850ab01d3d1111549de6ad2658_4
894532d8dc17 mirrorgooglecontainers/hyperkube:v1.9.9 "/hyperkube kubelet …" 2 days ago Up 4 hours kubelet
E0620 06:52:31.290332 2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
I0620 06:52:36.475720 2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:52:36.533373 2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:52:36.533436 2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:52:36.533451 2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:52:36.533483 2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:52:41.290606 2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
E0620 06:52:45.551364 2894 kubelet_node_status.go:106] Unable to register node "192.168.101.241" with API server: Unable to refresh the Webhook configuration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io)
I0620 06:52:46.533153 2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:52:46.589720 2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:52:46.589766 2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:52:46.589783 2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
E0620 06:52:47.575051 2894 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"192.168.101.241.161a211de50109bb", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"192.168.101.241", UID:"192.168.101.241", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node 192.168.101.241 status is now: NodeHasSufficientDisk", Source:v1.EventSource{Component:"kubelet", Host:"192.168.101.241"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfb37ca19e524dbb, ext:5998866895, loc:(*time.Location)(0xab3aa60)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfb38a68622bb171, ext:14113063445387, loc:(*time.Location)(0xab3aa60)}}, Count:2157, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within allowed duration' (will not retry!)
E0620 06:52:51.290866 2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
I0620 06:52:52.551622 2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:52:52.592016 2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:52:52.592077 2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:52:52.592101 2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:52:52.592137 2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:53:01.291156 2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
E0620 06:53:01.609558 2894 kubelet_node_status.go:106] Unable to register node "192.168.101.241" with API server: Unable to refresh the Webhook configuration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io)
I0620 06:53:08.609824 2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:53:08.665122 2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:53:08.665197 2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:53:08.665219 2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:53:08.665246 2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:53:11.291412 2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
E0620 06:53:17.682489 2894 kubelet_node_status.go:106] Unable to register node "192.168.101.241" with API server: Unable to refresh the Webhook configuration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io)
E0620 06:53:21.291740 2894 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.101.241" not found
I0620 06:53:24.682678 2894 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0620 06:53:24.736275 2894 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.101.241
I0620 06:53:24.736316 2894 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.101.241
I0620 06:53:24.736332 2894 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.101.241
I0620 06:53:24.736362 2894 kubelet_node_status.go:82] Attempting to register node 192.168.101.241
E0620 06:53:28.578268 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
E0620 06:53:28.579241 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:482: Failed to list *v1.Node: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
E0620 06:53:28.580317 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:473: Failed to list *v1.Service: the server was unable to return a response in the time allotted, but may still be processing the request (get services)
@ydye, is't possible that the etcd container is not running which caused this error?
I checked the logs of etcd container and get following: 2020-06-20 07:43:36.830391 I | etcdmain: etcd Version: 3.2.17 2020-06-20 07:43:36.830465 I | etcdmain: Git SHA: 28c47bb2f 2020-06-20 07:43:36.830488 I | etcdmain: Go Version: go1.8.7 2020-06-20 07:43:36.830495 I | etcdmain: Go OS/Arch: linux/amd64 2020-06-20 07:43:36.830502 I | etcdmain: setting maximum number of CPUs to 12, total number of available CPUs is 12 2020-06-20 07:43:36.830562 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2020-06-20 07:43:36.830637 I | embed: listening for peers on http://0.0.0.0:2380 2020-06-20 07:43:36.830683 I | embed: listening for client requests on 0.0.0.0:4001 2020-06-20 07:43:36.833184 I | etcdserver: recovered store from snapshot at index 9100091 2020-06-20 07:43:36.837011 I | mvcc: restore compact to 8869249 panic: runtime error: index out of range
goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).pageNode(0xc420053f40, 0x613123d463333321, 0x7f78d93d7000, 0x0)
/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20
I got the root cause: it's the etcd container failure and I remove the directory '/var/etcd/data' and k8s-bootup again. It's ok. I think maybe you forget to cleanup the etcd data in your k8s-clean scripts.
Organization Name: EduCloud
Short summary about the issue/question: Kubernetes cluster bootup failed with reponse "kubectl ready test failed. Exit paictl." Brief what process you are following: This error occured when I bootup the kubernets cluster: python paictl.py cluter k8s-bootup -p ./pai-config
I checked the pai/deployment/k8slibrary/maintainlib/kubectl_install.py and found it execute the script 'kubectl_install.sh'. The curl -OL kubectl command from googleapi failed due to GWF issue. I downloaded kubectl and put it on the dev-box. Thus the kubectl check responsed 'kubectl exists and skipped'. However, it failed when execute 'kubectl get nod'.
OpenPAI Environment: Ubuntu 16.04
uname -a
):Anything else we need to know: