nwcdheap / kops-cn

AWS中国宁夏区域/北京区域,快速Kops部署K8S集群
Apache License 2.0
121 stars 74 forks source link

apiserver status becomes Exited after started up for around 30 seconds #77

Closed TAM-Alex closed 5 years ago

TAM-Alex commented 5 years ago

请注意,kops-cn是一个开源项目帮助用户更容易部署kops在AWS中国北京与宁夏区,kops-cn没有侵入式修改上游kops源代码,并且保持跟上游kops版本一致,因此大部分kops-cn遇到的功能性问题都会存在上游kops专案当中,在发布问题的时候请务必确定查看并搜寻kops上游是否有人发布过同样的的问题,这里无法解决kops本身存在的问题或issue,如果它是一个kops本身的issue,请务必发布到上游kops专案的issue当中。

如果你很肯定这个issue只跟kops-cn有关,跟上游kops无关,请填写以下信息帮助我们定位问题与改进这个项目,并且尽可能提供截图给我们。

1. What kops version are you running? The command kops version, will display this information. 1.12.1 2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. v1.12.7 3. What AWS region are you using(Beijing or Ningxia)? Ningxia 4. What commands did you run? What is the simplest way to reproduce this issue? Followed the instruction exactly 5. What happened after the commands executed? After step 6. It is seen the apiserver container on master node status became Exited after started for about 30 seconds.Then from the ELB we see the master nodes status shows OutOfService 6. What did you expect to happen? Apiserver container shall be started up successfully. 7. Please provide the content of your Makefile and how did you run the make command You may want to remove your cluster name and other sensitive information. Like following:

customize the values below

TARGET_REGION ?= cn-northwest-1 AWS_PROFILE ?= default KOPS_STATE_STORE ?= s3://alexkops VPCID ?= vpc-01f4b2b1728bb4b82

VPCID ?= vpc-0654ec2e460225d14

MASTER_COUNT ?= 3 MASTER_SIZE ?= m4.large NODE_SIZE ?= c5.large NODE_COUNT ?= 2 SSH_PUBLIC_KEY ?= /usr/alex/NingXia.pub KUBERNETES_VERSION ?= v1.12.7 KOPS_VERSION ?= 1.12.1

8. Anything else do we need to know?

dockers ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 2f73be463162 34a1d991f46e "/bin/sh -c 'mkfifo ¡-" About a minute ago Exited (255) 43 seconds ago k8s_kube-apiserver_kube-apiserver-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_0aa53626ca09062412f5f6a8c4c41c06_7 8a0c7a2640e8 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/kopeio-etcd-manager "/bin/sh -c 'mkfifo ¡¬" 13 minutes ago Up 13 minutes k8s_etcd-manager_etcd-manager-main-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_f8f7dd3aaeffafaeebe693e1029de01e_0 f840fdb341d5 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/kopeio-etcd-manager "/bin/sh -c 'mkfifo ¡¬" 13 minutes ago Up 13 minutes k8s_etcd-manager_etcd-manager-events-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_69c79b4dca319f9a6fdff399f8caeba9_0 cef3611aa893 97863dc56aaf "/bin/sh -c 'mkfifo ¡-" 14 minutes ago Up 14 minutes k8s_kube-proxy_kube-proxy-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_82dcd821c9fb02e30dc2519da1b893cc_0 9bfefeade4c8 dae337b0c935 "/bin/sh -c 'mkfifo ¡-" 14 minutes ago Up 14 minutes k8s_kube-controller-manager_kube-controller-manager-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_27d9f630ca16278b2a05e360b4281e1d_0 35d5c56eaa3b 63458f2e089c "/bin/sh -c 'mkfifo ¡-" 14 minutes ago Up 14 minutes k8s_kube-scheduler_kube-scheduler-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_b56efd0787354f4657bf20ac0ad54dc0_0 88c346d54668 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 14 minutes ago Up 14 minutes k8s_POD_kube-apiserver-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_0aa53626ca09062412f5f6a8c4c41c06_0 b5efc7bfda91 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 14 minutes ago Up 14 minutes k8s_POD_etcd-manager-main-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_f8f7dd3aaeffafaeebe693e1029de01e_0 4826bff24341 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 14 minutes ago Up 14 minutes k8s_POD_etcd-manager-events-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_69c79b4dca319f9a6fdff399f8caeba9_0 e472cc8533c9 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 14 minutes ago Up 14 minutes k8s_POD_kube-proxy-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_82dcd821c9fb02e30dc2519da1b893cc_0 59d28965a91c 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 14 minutes ago Up 14 minutes k8s_POD_kube-controller-manager-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_27d9f630ca16278b2a05e360b4281e1d_0 7a98194860b9 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 14 minutes ago Up 14 minutes k8s_POD_kube-scheduler-ip-10-0-54-118.cn-northwest-1.compute.internal_kube-system_b56efd0787354f4657bf20ac0ad54dc0_0 c89ec89c56df protokube:1.12.1 "/usr/bin/protokube ¡-" 14 minutes ago Up 14 minutes modest_meninsky

Checked the docker logs and found following error:

F0528 03:41:42.636294 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt true true 1000 0xc420157560 5m0s 1m0s}), err (context deadline exceeded)

Port 4001 is opened

core@ip-10-0-54-118 ~ $ sudo netstat -anlp | grep 4001 tcp 0 0 10.0.54.118:51592 10.0.93.102:4001 FIN_WAIT2 - tcp 0 0 10.0.54.118:46836 10.0.93.102:4001 FIN_WAIT2 - tcp 0 0 10.0.54.118:42492 10.0.54.118:4001 FIN_WAIT2 - tcp 0 0 10.0.54.118:47336 10.0.126.88:4001 FIN_WAIT2 - tcp 0 0 10.0.54.118:32998 10.0.54.118:4001 FIN_WAIT2 - tcp 0 0 10.0.54.118:56318 10.0.93.102:4001 FIN_WAIT2 - tcp 0 0 10.0.54.118:52066 10.0.126.88:4001 ESTABLISHED 2582/etcd-manager tcp 0 0 10.0.54.118:37748 10.0.54.118:4001 FIN_WAIT2 - tcp 0 0 10.0.54.118:42590 10.0.126.88:4001 FIN_WAIT2 - tcp6 83 0 :::4001 :::* LISTEN 2714/etcd tcp6 251 0 10.0.54.118:4001 10.0.54.118:33770 CLOSE_WAIT - tcp6 156 0 127.0.0.1:4001 127.0.0.1:44408 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:39088 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:37748 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:43136 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:50436 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:34258 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:58278 CLOSE_WAIT - tcp6 156 0 127.0.0.1:4001 127.0.0.1:60398 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:48168 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:58110 CLOSE_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:58790 CLOSE_WAIT - tcp6 0 0 127.0.0.1:10252 127.0.0.1:40010 TIME_WAIT - tcp6 251 0 10.0.54.118:4001 10.0.54.118:47430 CLOSE_WAIT -

I checked the /var/log/etcd.log and found following information: 2019-05-28 09:26:28.581488 I | etcdmain: rejected connection from "10.0.126.88:46250" (error "remote error: tls: bad certificate", ServerName "etcd-a.internal.cluster.zhy.k8s.local")

However in another working cluster, we see the similar information as well: 2019-05-28 09:57:03.705609 I | etcdmain: rejected connection from "10.0.113.26:47312" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-peers-ca-main\")", ServerName "etcd-a.internal.cluster.zhy.k8s.local")

Checked the CA and it seems works

ip-10-0-54-65 kube-apiserver # openssl verify -CAfile etcd-ca.crt etcd-client.crt etcd-client.crt: OK

Additionally, Customer and I did test with exactly the same test with two AWS accounts (both are with Admin privilege) My cluster works yet customer's does not.

ectd_KubeAPIServer_log.zip

pahud commented 5 years ago

你是用 make create-cluster 創建集群嗎?

最後 make update-cluster 之後 make validate-cluster 出現什麼信息?

另外,請提供完整Makefile內容(如果有修改的話)

例如

# customize the values below
TARGET_REGION ?= cn-northwest-1
AWS_PROFILE ?= default
KOPS_STATE_STORE ?= s3://pahud-kops-state-store-zhy
VPCID ?= vpc-bb3e99d2
#VPCID ?= vpc-0654ec2e460225d14
MASTER_COUNT ?= 3
MASTER_SIZE ?= m4.large
NODE_SIZE ?= c5.large
NODE_COUNT ?= 2
SSH_PUBLIC_KEY ?= ~/.ssh/id_rsa.pub
KUBERNETES_VERSION ?= v1.12.7
KOPS_VERSION ?= 1.12.1

# do not modify following values
AWS_DEFAULT_REGION ?= $(TARGET_REGION)
AWS_REGION ?= $(AWS_DEFAULT_REGION)
ifeq ($(TARGET_REGION) ,cn-north-1)
    CLUSTER_NAME ?= cluster.bjs.k8s.local
    AMI ?= ami-09b54790f727ac576
    ZONES ?= cn-north-1a,cn-north-1b
endif

ifeq ($(TARGET_REGION) ,cn-northwest-1)
    CLUSTER_NAME ?= cluster.zhy.k8s.local
    AMI ?= ami-0cb93c9d844de0c18
    ZONES ?= cn-northwest-1a,cn-northwest-1b,cn-northwest-1c
endif

ifdef CUSTOM_CLUSTER_NAME
    CLUSTER_NAME = $(CUSTOM_CLUSTER_NAME)
endif

KUBERNETES_VERSION_URI ?= "https://s3.cn-north-1.amazonaws.com.cn/kubernetes-release/release/$(KUBERNETES_VERSION)"

.PHONY: create-cluster
create-cluster:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
    AWS_PROFILE=$(AWS_PROFILE) \
    AWS_REGION=$(AWS_REGION) \
    AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
    kops create cluster \
     --cloud=aws \
     --name=$(CLUSTER_NAME) \
     --image=$(AMI) \
     --zones=$(ZONES) \
     --master-count=$(MASTER_COUNT) \
     --master-size=$(MASTER_SIZE) \
     --node-count=$(NODE_COUNT) \
     --node-size=$(NODE_SIZE)  \
     --vpc=$(VPCID) \
     --kubernetes-version=$(KUBERNETES_VERSION_URI) \
     --networking=amazon-vpc-routed-eni \
     --ssh-public-key=$(SSH_PUBLIC_KEY)

 #--subnets=subnet-0694ca9e79cc3cfb6,subnet-03a0e3db1d77db089,subnet-050da82a687ff4968 \

.PHONY: edit-ig-nodes
edit-ig-nodes:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
    AWS_PROFILE=$(AWS_PROFILE) \
    AWS_REGION=$(AWS_REGION) \
    AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
    kops edit ig --name=$(CLUSTER_NAME) nodes

.PHONY: edit-cluster
edit-cluster:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
    AWS_PROFILE=$(AWS_PROFILE) \
    AWS_REGION=$(AWS_REGION) \
    AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
    kops edit cluster $(CLUSTER_NAME)

.PHONY: update-cluster
update-cluster:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
    AWS_PROFILE=$(AWS_PROFILE) \
    AWS_REGION=$(AWS_REGION) \
    AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
    kops update cluster $(CLUSTER_NAME) --yes

.PHONY: validate-cluster
 validate-cluster:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
    AWS_PROFILE=$(AWS_PROFILE) \
    AWS_REGION=$(AWS_REGION) \
    AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
    kops validate cluster

.PHONY: delete-cluster
 delete-cluster:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
    AWS_PROFILE=$(AWS_PROFILE) \
    AWS_REGION=$(AWS_REGION) \
    AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
    kops delete cluster --name $(CLUSTER_NAME) --yes

.PHONY: rolling-update-cluster
rolling-update-cluster:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
    AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops rolling-update cluster --name $(CLUSTER_NAME) --yes --cloudonly

.PHONY: get-cluster
get-cluster:
    @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops get cluster --name $(CLUSTER_NAME)
TAM-Alex commented 5 years ago

是用 make create-cluster 創建集群的。

完整的Makefile如下: [root@ip-192-168-0-47 kops-cn-master]# cat Makefile

# customize the values below
TARGET_REGION ?= cn-northwest-1
AWS_PROFILE ?= default
KOPS_STATE_STORE ?= s3://frsh-kops-test
VPCID ?= vpc-0b0bd0c1a16d5310c
#VPCID ?= vpc-0654ec2e460225d14
MASTER_COUNT ?= 3
MASTER_SIZE ?= m4.large
NODE_SIZE ?= c5.large
NODE_COUNT ?= 2
SSH_PUBLIC_KEY ?= ~/.ssh/id_rsa.pub
KUBERNETES_VERSION ?= v1.12.7
KOPS_VERSION ?= 1.12.1

# do not modify following values
AWS_DEFAULT_REGION ?= $(TARGET_REGION)
AWS_REGION ?= $(AWS_DEFAULT_REGION)
ifeq ($(TARGET_REGION) ,cn-north-1)
        CLUSTER_NAME ?= cluster.bjs.k8s.local
        AMI ?= ami-09b54790f727ac576
        ZONES ?= cn-north-1a,cn-north-1b
endif

ifeq ($(TARGET_REGION) ,cn-northwest-1)
        CLUSTER_NAME ?= cluster.zhy.k8s.local
        AMI ?= ami-0cb93c9d844de0c18
        ZONES ?= cn-northwest-1a,cn-northwest-1b,cn-northwest-1c
endif

ifdef CUSTOM_CLUSTER_NAME
        CLUSTER_NAME = $(CUSTOM_CLUSTER_NAME)
endif

KUBERNETES_VERSION_URI ?= "https://s3.cn-north-1.amazonaws.com.cn/kubernetes-release/release/$(KUBERNETES_VERSION)"

.PHONY: create-cluster
create-cluster:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops create cluster \
     --cloud=aws \
     --name=$(CLUSTER_NAME) \
     --image=$(AMI) \
     --zones=$(ZONES) \
     --master-count=$(MASTER_COUNT) \
     --master-size=$(MASTER_SIZE) \
     --node-count=$(NODE_COUNT) \
     --node-size=$(NODE_SIZE)  \
     --vpc=$(VPCID) \
     --kubernetes-version=$(KUBERNETES_VERSION_URI) \
     --networking=amazon-vpc-routed-eni \
     --ssh-public-key=$(SSH_PUBLIC_KEY)

 #--subnets=subnet-0694ca9e79cc3cfb6,subnet-03a0e3db1d77db089,subnet-050da82a687ff4968 \

.PHONY: edit-ig-nodes
edit-ig-nodes:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops edit ig --name=$(CLUSTER_NAME) nodes

.PHONY: edit-cluster
edit-cluster:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops edit cluster $(CLUSTER_NAME)

.PHONY: update-cluster
update-cluster:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops update cluster $(CLUSTER_NAME) --yes

.PHONY: validate-cluster
 validate-cluster:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops validate cluster

.PHONY: delete-cluster
 delete-cluster:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops delete cluster --name $(CLUSTER_NAME) --yes

.PHONY: rolling-update-cluster
rolling-update-cluster:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops rolling-update cluster --name $(CLUSTER_NAME) --yes --cloudonly

.PHONY: get-cluster
get-cluster:
        @KOPS_STATE_STORE=$(KOPS_STATE_STORE) \
        AWS_PROFILE=$(AWS_PROFILE) \
        AWS_REGION=$(AWS_REGION) \
        AWS_DEFAULT_REGION=$(AWS_DEFAULT_REGION) \
        kops get cluster --name $(CLUSTER_NAME)

[root@ip-192-168-0-47 kops-cn-master]#

make validate-cluster 结果如下:

[root@ip-192-168-0-47 kops-cn-master]# make validate-cluster Using cluster from kubectl context: cluster.zhy.k8s.local

Validating cluster cluster.zhy.k8s.local

unexpected error during validation: error listing nodes: Get https://api-cluster-zhy-k8s-local-qpbf7n-1438908736.cn-northwest-1.elb.amazonaws.com.cn/api/v1/nodes: dial tcp 52.82.86.48:443: i/o timeout make: *** [validate-cluster] Error 1 [root@ip-192-168-0-47 kops-cn-master]#

另外,从一个Master Node上看,apiserver container没有运行

core@ip-10-0-126-88 ~ $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAME 49b60ee6b3f6 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/kopeio-etcd-manager "/bin/sh -c 'mkfifo …" 21 hours ago Up 21 hours k8s 469936ff13fe 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/kopeio-etcd-manager "/bin/sh -c 'mkfifo …" 21 hours ago Up 21 hours k8s 7c8ea37cd32e dae337b0c935 "/bin/sh -c 'mkfifo …" 21 hours ago Up 21 hours k8s a828906ba1c6 63458f2e089c "/bin/sh -c 'mkfifo …" 21 hours ago Up 21 hours k8s 1e7a9ce5215f 97863dc56aaf "/bin/sh -c 'mkfifo …" 21 hours ago Up 21 hours k8s d73d71543f4f 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 21 hours ago Up 21 hours k8s 52fdb276f48e 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 21 hours ago Up 21 hours k8s bfd49aa2862f 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 21 hours ago Up 21 hours k8s ca5c98d766be 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 21 hours ago Up 21 hours k8s f46f77f6e334 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 21 hours ago Up 21 hours k8s 4fc931c34bf7 937788672844.dkr.ecr.cn-north-1.amazonaws.com.cn/pause-amd64:3.0 "/pause" 21 hours ago Up 21 hours k8s_ 8d2c65e618d2 protokube:1.12.1 "/usr/bin/protokube …" 21 hours ago Up 21 hours jovi core@ip-10-0-126-88 ~ $

TAM-Alex commented 5 years ago

Status Update:

经过反复测试发现,只要在make update-cluster之后master node刚刚启动的时候SSH登录就可以复现这个问题。相反,如果在master node 启动的一段时间之后再登陆则不会有这个问题。

此问题是etcd选举失败造成的,可以通过吧ectd设置为legacy并运行rolling update来解决

https://github.com/kubernetes/kops/blob/master/docs/etcd/manager.md

目前的疑问是在于为什么在master node 启动时SSH登陆会导致ectd选举失败?

pahud commented 5 years ago

make update-cluster之後一般需要等5-10min之後才能make validate-cluster

你可以觀察ELB後面三台API server是否都InService, 都InService之後才能執行make validate-cluster

我在寧夏region測試完全沒問題, 完全不需要SSH進去master

image

image

image

image

pahud commented 5 years ago

如果你一直無法InService

可以SSH進master之後執行

$ sudo journalctl -f

查看是否有異常結果

正常情況應該會看到這樣的message

image