kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.57k stars 1.31k forks source link

GCP Update controlPlane Version Fails #182

Closed ichekrygin closed 6 years ago

ichekrygin commented 6 years ago

Overview

Updating controlPlane value failes and crashes clusterapi

Repro Steps

  1. Create new Kubernetes cluster on GCE via gcp-deployer create -c cluster.yaml -m machines.yaml -s machine_setup_configs.yaml using CONTRIBUTIN.md
    I0516 15:10:33.474771   10301 deploy_helper.go:58] Starting cluster dependency creation cluster-api-
    ...
    I0516 15:14:49.467724   10301 deploy.go:85] The [cluster-api-test] cluster has been created successfully!
    I0516 15:14:49.467740   10301 deploy.go:86] You can now `kubectl get nodes`
  2. Verify cluster nodes
    kubectl get nodes
    NAME                          STATUS    ROLES     AGE       VERSION
    gce-master-cluster-api-test   Ready     master    5m        v1.9.4
    gce-node-vhjpt                Ready     <none>    1m        v1.9.4
  3. Verify clusterapi machines
    kubectl get machines
    NAME                          AGE
    gce-master-cluster-api-test   4m
    gce-node-vhjpt                4m
  4. Edit gce-master-cluster-api-test machine defintion via kubectl edit machines/gce-master-cluster-api-test --validate=false and update controlPlane from 1.9.4 to 1.9.7

Expected results

Kubernetes cluster control plane is upgraded from 1.9.4 to 1.9.7

Actual Result

Kubernetes cluster control plane is not upgraded. Clusterapi api and controller(s) appears to be in a crashed state.

journalctl on master node immediatly after update at 22:19

...
May 16 22:17:11 gce-master-cluster-api-test ntpd[1492]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
May 16 22:19:49 gce-master-cluster-api-test sshd[7876]: Connection closed by 104.197.87.198 port 49546 [preauth]
May 16 22:19:49 gce-master-cluster-api-test sshd[7880]: Connection closed by 104.197.87.198 port 49548 [preauth]
May 16 22:19:50 gce-master-cluster-api-test sshd[7883]: Connection closed by 104.197.87.198 port 49550 [preauth]
May 16 22:19:50 gce-master-cluster-api-test sshd[7886]: Connection closed by 104.197.87.198 port 49552 [preauth]
May 16 22:19:50 gce-master-cluster-api-test sshguard[1304]: Blocking 104.197.87.198:4 for >630secs: 40 danger in 4 attacks over 0 seconds (all: 40d in 1 abuses over 0s).
May 16 22:20:03 gce-master-cluster-api-test kubelet[5263]: E0516 22:20:03.450557    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test?resourceVersion=0: net/http: request canceled (Client
May 16 22:20:13 gce-master-cluster-api-test kubelet[5263]: E0516 22:20:13.450898    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:20:23 gce-master-cluster-api-test kubelet[5263]: E0516 22:20:23.451240    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:20:33 gce-master-cluster-api-test kubelet[5263]: E0516 22:20:33.452118    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:20:43 gce-master-cluster-api-test kubelet[5263]: E0516 22:20:43.452544    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:20:43 gce-master-cluster-api-test kubelet[5263]: E0516 22:20:43.452579    5263 kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count
May 16 22:21:03 gce-master-cluster-api-test kubelet[5263]: E0516 22:21:03.453322    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test?resourceVersion=0: net/http: request canceled (Client
May 16 22:21:13 gce-master-cluster-api-test kubelet[5263]: E0516 22:21:13.453775    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:21:23 gce-master-cluster-api-test kubelet[5263]: E0516 22:21:23.454176    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:21:33 gce-master-cluster-api-test kubelet[5263]: E0516 22:21:33.454585    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:21:43 gce-master-cluster-api-test kubelet[5263]: E0516 22:21:43.454888    5263 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "gce-master-cluster-api-test": Get https://104.197.87.198:443/api/v1/nodes/gce-master-cluster-api-test: net/http: request canceled (Client.Timeout exceeded 
May 16 22:21:43 gce-master-cluster-api-test kubelet[5263]: E0516 22:21:43.455267    5263 kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count
May 16 22:21:57 gce-master-cluster-api-test dockerd[4800]: time="2018-05-16T22:21:57.544560440Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/f857ebd8b49e4eae6f9031eede8f16d1a76373abb54a8ab2b6cd9dd65f851be6: exec: \"auplink\": executable file not found in $PATH"
May 16 22:21:57 gce-master-cluster-api-test dockerd[4800]: time="2018-05-16T22:21:57.573475724Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/fd3f0cf8c2e7b346922da811a61d6775aee1b930b974cc966b8940cebca3d6df-init: exec: \"auplink\": executable file not found in $PATH"
May 16 22:21:57 gce-master-cluster-api-test dockerd[4800]: time="2018-05-16T22:21:57.603998318Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/fd3f0cf8c2e7b346922da811a61d6775aee1b930b974cc966b8940cebca3d6df: exec: \"auplink\": executable file not found in $PATH"
May 16 22:21:57 gce-master-cluster-api-test dockerd[4800]: time="2018-05-16T22:21:57.732588705Z" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container 344c122adda517044878cff95b559a2720be773f366b5f3365e6897fa94f2d30"
...
rsdcastro commented 6 years ago

cc @k4leung4 @maisem

Can you comment on this issue?

k4leung4 commented 6 years ago

I was able to reproduce the issue and the error that ssh is reporting is "Host key verification failed."

I need to dig deeper to figure out what changed to cause this error.

k4leung4 commented 6 years ago

/reopen

This is still waiting on PR #247

k8s-ci-robot commented 6 years ago

@k4leung4: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/182#issuecomment-393347175): >/reopen > >This is still waiting on PR #247 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k4leung4 commented 6 years ago

/assign @k4leung4

k4leung4 commented 6 years ago

/reopen

k8s-ci-robot commented 6 years ago

@k4leung4: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/182#issuecomment-393347266): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ichekrygin commented 6 years ago

@k4leung4 FWIW: I authored it, and I cannot reopen it either ¯_(ツ)_/¯