Deleted kops cluster accidentally but we still have the S3 state store as the cluster deletion did not happen completely.

bravitejareddy commented 4 years ago

1. What kops version are you running? The command kops version, will display this information.

Version 1.16.0 (git-4b0e62b82)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.8", GitCommit:"ec6eb119b81be488b030e849b9e64fda4caaf33c", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.8", GitCommit:"ec6eb119b81be488b030e849b9e64fda4caaf33c", GitTreeState:"clean", BuildDate:"2020-03-12T20:52:22Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops delete cluster --state s3:// --yes

5. What happened after the commands executed?

kops delete cluster --name cluster.example.com --yes TYPE NAME ID iam-role nodes.cluster.example.com nodes.cluster.example.com route-table cluster.example.com rtb-a1df45d9 security-group sg-038bdc76183b74f7f security-group sg-03bef5f414245571c security-group sg-0415df60a5a3f66b1 security-group sg-087d6a22ee9967946 security-group sg-0b94d4cc7821a2c61 security-group sg-0c92fdbb42d44bb29 security-group sg-0ce50f2ed6db9db7b security-group sg-0da9c7a709f209239 security-group api-elb.cluster.example.com sg-07f19e3cd4ac2dceb security-group masters.cluster.example.com sg-0b508c1833d2438f2 security-group nodes.cluster.example.com sg-0082b25217cab5a22 volume a.etcd-events.cluster.example.com vol-09e3280744cfa36e8 volume a.etcd-main.cluster.example.com vol-0c9f89990419c7d81 volume cluster.example.com-dynamic-pvc-0429f7e4-001e-11ea-930e-06821db2cc70 vol-0cb26c7546a7842ab volume cluster.example.com-dynamic-pvc-0f32908f-7280-4488-b653-ccc1f91e2bb6 vol-02317f1bc7ba19adb volume cluster.example.com-dynamic-pvc-137b7412-c938-45d2-98f2-1ef9a5849c9f vol-04c2c310cd9c43a12 volume cluster.example.com-dynamic-pvc-15e2afe4-001e-11ea-930e-06821db2cc70 vol-01c31ea254a8c635b volume cluster.example.com-dynamic-pvc-21ecd3c9-001e-11ea-930e-06821db2cc70 vol-05b17bdf3ac0656fd volume cluster.example.com-dynamic-pvc-3018fa01-001e-11ea-930e-06821db2cc70 vol-0345f09f1e277afbc volume cluster.example.com-dynamic-pvc-3c2c9271-001e-11ea-930e-06821db2cc70 vol-0f9dcbee58c7b49f7 volume cluster.example.com-dynamic-pvc-3d9ec513-8f44-4128-8dc5-768571d5a826 vol-0329e77824dc40901 volume cluster.example.com-dynamic-pvc-419d89fd-a410-46b0-8720-26c42be63055 vol-08ced5119948fd17c volume cluster.example.com-dynamic-pvc-5e3d79e5-e55f-11e9-8b8f-069b6823c94e vol-01a3c2ec3a7b7f936 volume cluster.example.com-dynamic-pvc-6537f027-1573-49c0-a877-98ca568daefe vol-02906b143bdfed2e4 volume cluster.example.com-dynamic-pvc-6b1c148e-e55f-11e9-8b8f-069b6823c94e vol-0c6b909c7673e41d8 volume cluster.example.com-dynamic-pvc-71cd9aef-e55f-11e9-8b8f-069b6823c94e vol-02d442633a0e6135f volume cluster.example.com-dynamic-pvc-8574ea26-e55f-11e9-8b8f-069b6823c94e vol-0d8d4e6b9b9c4944f volume cluster.example.com-dynamic-pvc-97d1cd9d-e55f-11e9-8b8f-069b6823c94e vol-0a2ce7c3826950397 volume cluster.example.com-dynamic-pvc-9a48e031-495c-11ea-80e7-0649f4053cea vol-078f2516103d6bfd4 volume cluster.example.com-dynamic-pvc-a2eab171-e55f-11e9-8b8f-069b6823c94e vol-04c910499d7d33251 volume cluster.example.com-dynamic-pvc-a7859d3d-9e01-4987-a00c-606bae9a98ed vol-0c7c0b9c8ecf8f00b volume cluster.example.com-dynamic-pvc-c1fd137c-39bd-4e08-8483-644f800ddd8d vol-0b077a2814fa41f87 volume cluster.example.com-dynamic-pvc-c650bb50-73e8-4fbf-97c9-b3e085f319eb vol-03831eb640caaec70 volume cluster.example.com-dynamic-pvc-cb7a7813-353a-4cd3-bfeb-15f632fafb6a vol-0002df3288b176be3 volume cluster.example.com-dynamic-pvc-cfc6766c-20ad-46a3-b20e-d9d394b31022 vol-0ec63aa8b3473588d volume cluster.example.com-dynamic-pvc-d3aad36a-bc69-4f96-b7a0-9afb6be6c3d3 vol-03a88d573ed44ace2 volume cluster.example.com-dynamic-pvc-d50bdf19-1449-4ac8-b072-fe0d7b642f0c vol-0365083e730ed9f54 volume cluster.example.com-dynamic-pvc-e2a99792-9060-401e-a1a9-b9f129b36164 vol-02706428ebe977abe volume cluster.example.com-dynamic-pvc-fc61c5ab-00a0-46a6-8597-3252215052e0 vol-046d7d7e47b4286a3 volume cluster.example.com-dynamic-pvc-fc9ea79d-001d-11ea-930e-06821db2cc70 vol-0f42210140d163bef volume cluster.example.com-dynamic-pvc-ff05b01a-9c2d-4eec-9f5b-f438dcb19164 vol-06643dc6d0c8a14dc

volume:vol-08ced5119948fd17c ok iam-role:nodes.cluster.example.com error deleting resources, will retry: error deleting IAM role "nodes.cluster.example.com": DeleteConflict: Cannot delete entity, must detach all policies first. status code: 409, request id: 0d7d6dc6-fe8f-4782-a49b-52aff26f9f93 volume:vol-02317f1bc7ba19adb ok volume:vol-02d442633a0e6135f ok volume:vol-02906b143bdfed2e4 ok volume:vol-02706428ebe977abe ok volume:vol-0365083e730ed9f54 ok volume:vol-0d8d4e6b9b9c4944f ok volume:vol-03a88d573ed44ace2 ok volume:vol-0345f09f1e277afbc ok volume:vol-0002df3288b176be3 ok volume:vol-01c31ea254a8c635b ok volume:vol-0c6b909c7673e41d8 ok volume:vol-0329e77824dc40901 ok volume:vol-0c7c0b9c8ecf8f00b ok volume:vol-09e3280744cfa36e8 ok volume:vol-06643dc6d0c8a14dc ok volume:vol-0c9f89990419c7d81 ok volume:vol-078f2516103d6bfd4 ok volume:vol-0f9dcbee58c7b49f7 ok volume:vol-0cb26c7546a7842ab ok volume:vol-04c910499d7d33251 ok volume:vol-05b17bdf3ac0656fd ok volume:vol-0a2ce7c3826950397 ok volume:vol-04c2c310cd9c43a12 ok volume:vol-0ec63aa8b3473588d ok volume:vol-046d7d7e47b4286a3 ok volume:vol-0f42210140d163bef ok security-group:sg-0ce50f2ed6db9db7b still has dependencies, will retry security-group:sg-0b508c1833d2438f2 still has dependencies, will retry volume:vol-01a3c2ec3a7b7f936 ok security-group:sg-0b94d4cc7821a2c61 still has dependencies, will retry security-group:sg-07f19e3cd4ac2dceb ok volume:vol-03831eb640caaec70 ok security-group:sg-087d6a22ee9967946 still has dependencies, will retry security-group:sg-038bdc76183b74f7f still has dependencies, will retry security-group:sg-03bef5f414245571c still has dependencies, will retry security-group:sg-0415df60a5a3f66b1 still has dependencies, will retry security-group:sg-0c92fdbb42d44bb29 still has dependencies, will retry security-group:sg-0da9c7a709f209239 ok volume:vol-0b077a2814fa41f87 ok security-group:sg-0082b25217cab5a22 ok Not all resources deleted; waiting before reattempting deletion security-group:sg-0415df60a5a3f66b1 security-group:sg-0b508c1833d2438f2 security-group:sg-03bef5f414245571c security-group:sg-038bdc76183b74f7f iam-role:nodes.cluster.example.com security-group:sg-0ce50f2ed6db9db7b route-table:rtb-a1df45d9 security-group:sg-087d6a22ee9967946 security-group:sg-0c92fdbb42d44bb29 security-group:sg-0b94d4cc7821a2c61

6. What did you expect to happen?

Its is expected that the cluster gets deleted completely but this command was run accidentally somehow the cluster did not delete completely and we have the s3 stage file as-is. So thought of restoring the cluster since we still have the state store for the cluster in S3 bucket.

We have run the below commands but none of the worked as expected

kops update cluster --state s3:// --yes

kops rolling-update cluster --state s3:// --yes

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2 kind: Cluster metadata: creationTimestamp: "2019-08-22T11:08:54Z" generation: 20 name: cluster.example.com spec: api: loadBalancer: type: Public authorization: rbac: {} channel: stable cloudLabels: Component: node Environment: Dev cloudProvider: aws configBase: s3://kops-prefix-example-com-state-store/cluster.example.com dnsZone: Z1L1KLPJWQ18XZ etcdClusters:

cpuRequest: 200m etcdMembers:
- encryptedVolume: true instanceGroup: master-us-west-2a name: a memoryRequest: 100Mi name: main version: 3.2.24
cpuRequest: 100m etcdMembers:
- encryptedVolume: true instanceGroup: master-us-west-2a name: a memoryRequest: 100Mi name: events version: 3.2.24 iam: allowContainerRegistry: true legacy: false kubelet: anonymousAuth: false kubernetesApiAccess:
113.168.150.208/32 kubernetesVersion: 1.16.8 masterInternalName: api.internal.cluster.example.com masterPublicName: api.cluster.example.com networkCIDR: 172.31.0.0/16 networkID: vpc-ad04bed4 networking: calico: majorVersion: v3 nonMasqueradeCIDR: 100.64.0.0/10 sshAccess:
113.168.150.208/32 subnets:
cidr: 172.31.96.0/20 id: subnet-f79368bc name: us-west-2a type: Public zone: us-west-2a topology: dns: type: Public masters: public nodes: public

apiVersion: kops.k8s.io/v1alpha2 kind: InstanceGroup metadata: creationTimestamp: "2019-08-22T11:17:55Z" generation: 6 labels: kops.k8s.io/cluster: cluster.example.com name: master-us-west-2a spec: image: ami-0dff5bcxxxxxxxxxx machineType: t2.2xlarge maxSize: 1 minSize: 1 nodeLabels: kops.k8s.io/instancegroup: master-us-west-2a role: Master subnets:

us-west-2a

apiVersion: kops.k8s.io/v1alpha2 kind: InstanceGroup metadata: creationTimestamp: "2019-08-22T11:08:06Z" generation: 5 labels: kops.k8s.io/cluster: cluster.example.com name: nodes spec: image: ami-0dff5bcxxxxxxxxxx machineType: t2.2xlarge maxSize: 8 minSize: 8 nodeLabels: kops.k8s.io/instancegroup: nodes role: Node subnets:

us-west-2a

apiVersion: kops.k8s.io/v1alpha2 kind: InstanceGroup metadata: creationTimestamp: "2019-08-22T11:34:39Z" generation: 3 labels: kops.k8s.io/cluster: cluster.example.com name: logging spec: image: ami-0dff5bcxxxxxxxxxx machineType: m5.2xlarge maxSize: 1 minSize: 1 nodeLabels: app: logging kops.k8s.io/instancegroup: logging role: Node subnets:

us-west-2a taints:
dedicated=true:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

ravi@:~/.kube$ kops validate cluster --state s3://k8s-state-store Using cluster from kubectl context: cluster.example.com Validating cluster cluster.example.com unexpected error during validation: error listing nodes: Get https://api.cluster.example.com/api/v1/nodes: EOF

The API server docker container inside the master node crashes.

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ae338222c8e3 48db9392345b "/usr/local/bin/kube…" 4 minutes ago Exited (2) 4 minutes ago k8s_kube-apiserver_kube-apiserver-ip-172-31-105-229.us-west-2.compute.internal_kube-system_5e4f97006769a8b418c43b71fd2522ab_1743 3b56ad4901a2 b09b8cd2bebc "/bin/sh -c 'mkfifo …" 6 days ago Up 6 days k8s_etcd-manager_etcd-manager-main-ip-172-31-105-229.us-west-2.compute.internal_kube-system_1f273b069af38ea797d4a9d2b7b58044_1 36710f51905c 01aec835c89f "/usr/local/bin/kube…" 6 days ago Up 6 days k8s_kube-controller-manager_kube-controller-manager-ip-172-31-105-229.us-west-2.compute.internal_kube-system_a426653dfb4f3c931fb4a74f3078c9fc_1 90d17454d5a3 133a50b2b327 "/usr/local/bin/kube…" 6 days ago Up 6 days k8s_kube-scheduler_kube-scheduler-ip-172-31-105-229.us-west-2.compute.internal_kube-system_2096e4a5044b24b3acabe5b05cd067ae_1 9d0aae5ca3a0 3b8ffbdbcca3 "/usr/local/bin/kube…" 6 days ago Up 6 days k8s_kube-proxy_kube-proxy-ip-172-31-105-229.us-west-2.compute.internal_kube-system_dbb640f05fde18ed813490602151b1ed_1 8d0ad8d5497f protokube:1.16.1 "/usr/bin/protokube …" 6 days ago Up 6 days stoic_leavitt c6c385d18f45 k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Up 6 days k8s_POD_etcd-manager-main-ip-172-31-105-229.us-west-2.compute.internal_kube-system_1f273b069af38ea797d4a9d2b7b58044_1 4468e0b7baee k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Up 6 days k8s_POD_kube-scheduler-ip-172-31-105-229.us-west-2.compute.internal_kube-system_2096e4a5044b24b3acabe5b05cd067ae_1 114be278b119 k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Up 6 days k8s_POD_kube-controller-manager-ip-172-31-105-229.us-west-2.compute.internal_kube-system_a426653dfb4f3c931fb4a74f3078c9fc_1 18dd9182db9f k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Up 6 days k8s_POD_kube-proxy-ip-172-31-105-229.us-west-2.compute.internal_kube-system_dbb640f05fde18ed813490602151b1ed_1 e39dc784aa57 b09b8cd2bebc "/bin/sh -c 'mkfifo …" 6 days ago Up 6 days k8s_etcd-manager_etcd-manager-events-ip-172-31-105-229.us-west-2.compute.internal_kube-system_24eab7d0bc2450b40334ed39939b3e23_1 e4df76ea3e7c k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Up 6 days k8s_POD_kube-apiserver-ip-172-31-105-229.us-west-2.compute.internal_kube-system_5e4f97006769a8b418c43b71fd2522ab_1 c4f373f8ae5d k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Up 6 days k8s_POD_etcd-manager-events-ip-172-31-105-229.us-west-2.compute.internal_kube-system_24eab7d0bc2450b40334ed39939b3e23_1 485eb1cf4918 kopeio/etcd-manager "/bin/sh -c 'mkfifo …" 6 days ago Exited (143) 6 days ago k8s_etcd-manager_etcd-manager-events-ip-172-31-105-229.us-west-2.compute.internal_kube-system_24eab7d0bc2450b40334ed39939b3e23_0 9ce3787295a5 kopeio/etcd-manager "/bin/sh -c 'mkfifo …" 6 days ago Exited (143) 6 days ago k8s_etcd-manager_etcd-manager-main-ip-172-31-105-229.us-west-2.compute.internal_kube-system_1f273b069af38ea797d4a9d2b7b58044_0 d6ff10da4ba4 k8s.gcr.io/kube-scheduler "/usr/local/bin/kube…" 6 days ago Exited (2) 6 days ago k8s_kube-scheduler_kube-scheduler-ip-172-31-105-229.us-west-2.compute.internal_kube-system_2096e4a5044b24b3acabe5b05cd067ae_0 a49c44779d64 k8s.gcr.io/kube-controller-manager "/usr/local/bin/kube…" 6 days ago Exited (2) 6 days ago k8s_kube-controller-manager_kube-controller-manager-ip-172-31-105-229.us-west-2.compute.internal_kube-system_a426653dfb4f3c931fb4a74f3078c9fc_0 2877ec5fc30c k8s.gcr.io/kube-proxy "/usr/local/bin/kube…" 6 days ago Exited (2) 6 days ago k8s_kube-proxy_kube-proxy-ip-172-31-105-229.us-west-2.compute.internal_kube-system_dbb640f05fde18ed813490602151b1ed_0 84421c2d26b1 k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Exited (0) 6 days ago k8s_POD_kube-controller-manager-ip-172-31-105-229.us-west-2.compute.internal_kube-system_a426653dfb4f3c931fb4a74f3078c9fc_0 4fc6d3b1e3c3 k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Exited (0) 6 days ago k8s_POD_etcd-manager-events-ip-172-31-105-229.us-west-2.compute.internal_kube-system_24eab7d0bc2450b40334ed39939b3e23_0 92f9ffac2908 k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Exited (0) 6 days ago k8s_POD_kube-proxy-ip-172-31-105-229.us-west-2.compute.internal_kube-system_dbb640f05fde18ed813490602151b1ed_0 19dcdf8bdbff k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Exited (0) 6 days ago k8s_POD_kube-scheduler-ip-172-31-105-229.us-west-2.compute.internal_kube-system_2096e4a5044b24b3acabe5b05cd067ae_0 a321fc658d78 k8s.gcr.io/pause-amd64:3.0 "/pause" 6 days ago Exited (0) 6 days ago k8s_POD_etcd-manager-main-ip-172-31-105-229.us-west-2.compute.internal_kube-system_1f273b069af38ea797d4a9d2b7b58044_0 c4715c0e8e07 protokube:1.16.1 "/usr/bin/protokube …" 6 days ago Exited (143) 6 days ago condescending_chaplygin

I0720 06:07:21.795793 I0720 06:07:21.796015 W0720 06:07:22.524493 I0720 06:07:22.524967 I0720 06:07:22.524983 W0720 06:07:22.525780 I0720 06:07:22.526428 I0720 06:07:22.526452 I0720 06:07:22.528895 I0720 06:07:22.528938 W0720 06:07:22.529247 I0720 06:07:23.524033 I0720 06:07:23.524075 W0720 06:07:23.524424 W0720 06:07:23.529621 W0720 06:07:24.524951 W0720 06:07:25.177622 W0720 06:07:26.120850 W0720 06:07:28.219847 W0720 06:07:28.970805 W0720 06:07:32.256519 W0720 06:07:32.295703 W0720 06:07:38.303697 W0720 06:07:38.489353 panic: context deadline exceeded 1 server.go:666] Initializing cache sizes based on 0MB limit 1 server.go:149] Version: v1.16.8 1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass. 1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota. 1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass. 1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota. 1 client.go:357] parsed scheme: "endpoint" 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:4001 0 }] 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 client.go:357] parsed scheme: "endpoint" 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:4001 0 }] 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting... 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...

goroutine 1 [running]: k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/registry/customresourcedefinition.NewREST(0xc0005269a0, 0x4e6da40, 0xc000458900, 0xc000458b28) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/registry/customresourcedefinition/etcd.go:56 +0x3c1 k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/apiserver.completedConfig.New(0xc000b504e0, 0xc000242508, 0x4f28180, 0x73f92c8, 0x10, 0x0, 0x0) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:147 +0x152b k8s.io/kubernetes/cmd/kube-apiserver/app.createAPIExtensionsServer(0xc000242500, 0x4f28180, 0x73f92c8, 0x0, 0x4e6d6a0, 0xc00075cae0) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/apiextensions.go:95 +0x59 k8s.io/kubernetes/cmd/kube-apiserver/app.CreateServerChain(0xc000111b80, 0xc0000f2a80, 0x43ee6c0, 0xc, 0xc000a37c48) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:182 +0x2bc k8s.io/kubernetes/cmd/kube-apiserver/app.Run(0xc000111b80, 0xc0000f2a80, 0x0, 0x0) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:151 +0x101 k8s.io/kubernetes/cmd/kube-apiserver/app.NewAPIServerCommand.func1(0xc0007baa00, 0xc00051c000, 0x0, 0x24, 0x0, 0x0) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:118 +0x104 k8s.io/kubernetes/vendor/github.com/spf13/cobra.(Command).execute(0xc0007baa00, 0xc0000d8010, 0x24, 0x27, 0xc0007baa00, 0xc0000d8010) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:826 +0x460 k8s.io/kubernetes/vendor/github.com/spf13/cobra.(Command).ExecuteC(0xc0007baa00, 0x162360e7d1c757ac, 0x73db3a0, 0xc000078750) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914 +0x2fb k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...) /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864 main.main() _output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/apiserver.go:43 +0xcd

9. Anything else do we need to know?

The API Server in the ELB shows the health checks failed and the EC2 instance is OutOfService.

hakman commented 4 years ago

I think this was addressed via Slack.

bravitejareddy commented 4 years ago

Yes, @hakman I am trying to reproduce the same to identify what causes this situation.

I created a test cluster using kops
Deployed couple of apps (Istio)
Performed delete cluster and interrupted quite after some time before it deleted the state store from S3 bucket.
After the master and slave nodes are deleted, waited for couple of minutes and performed the kops update and I am able to bring my kops cluster back to normal
Again deleted the cluster and waited for couple of hours and again performed the kops update cluster my master and nodes are up by the APIServer Docker container is crashing this is what has happened in my (Prod Cluster)
Any idea what possible would have gone wrong?

As per the kops Kubernetes back restore doc I am able to recover the cluster using the etcd-manager-cli commands but i would like to know why my API server os exiting.

CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS                          PORTS               NAMES
58cdec9cefb0        48db9392345b                         "/usr/local/bin/kube…"   2 minutes ago       Exited (2) About a minute ago                       k8s_kube-apiserver_kube-apiserver-ip-172-20-40-154.us-east-2.compute.internal_kube-system_83d0b8923ed5310d94d032547c6eeb6e_195
24e21d5dff0f        k8s.gcr.io/kube-controller-manager   "/usr/local/bin/kube…"   17 hours ago        Up 17 hours                                         k8s_kube-controller-manager_kube-controller-manager-ip-172-20-40-154.us-east-2.compute.internal_kube-system_eaee0f6f1c5ee8b4bd94da9f58089984_0
0d4f6bd743f9        k8s.gcr.io/kube-proxy                "/usr/local/bin/kube…"   17 hours ago        Up 17 hours                                         k8s_kube-proxy_kube-proxy-ip-172-20-40-154.us-east-2.compute.internal_kube-system_7f95b8c170f60baf78f18d3badc82c91_0
cfd2739bd1d7        kopeio/etcd-manager                  "/bin/sh -c 'mkfifo …"   17 hours ago        Up 17 hours                                         k8s_etcd-manager_etcd-manager-main-ip-172-20-40-154.us-east-2.compute.internal_kube-system_f6bf9d005285b23146cbeb12fe6db5a3_0
1951e2b656c7        kopeio/etcd-manager                  "/bin/sh -c 'mkfifo …"   17 hours ago        Up 17 hours                                         k8s_etcd-manager_etcd-manager-events-ip-172-20-40-154.us-east-2.compute.internal_kube-system_813355344caf142b03ae8e9e77b3e672_0
6e27b2056595        k8s.gcr.io/kube-scheduler            "/usr/local/bin/kube…"   17 hours ago        Up 17 hours                                         k8s_kube-scheduler_kube-scheduler-ip-172-20-40-154.us-east-2.compute.internal_kube-system_2096e4a5044b24b3acabe5b05cd067ae_0
1ee7e2707bd6        k8s.gcr.io/pause-amd64:3.0           "/pause"                 17 hours ago        Up 17 hours                                         k8s_POD_kube-controller-manager-ip-172-20-40-154.us-east-2.compute.internal_kube-system_eaee0f6f1c5ee8b4bd94da9f58089984_0
f4f739496471        k8s.gcr.io/pause-amd64:3.0           "/pause"                 17 hours ago        Up 17 hours                                         k8s_POD_etcd-manager-events-ip-172-20-40-154.us-east-2.compute.internal_kube-system_813355344caf142b03ae8e9e77b3e672_0
305ed2a65069        k8s.gcr.io/pause-amd64:3.0           "/pause"                 17 hours ago        Up 17 hours                                         k8s_POD_kube-scheduler-ip-172-20-40-154.us-east-2.compute.internal_kube-system_2096e4a5044b24b3acabe5b05cd067ae_0
0162e99fb6c0        k8s.gcr.io/pause-amd64:3.0           "/pause"                 17 hours ago        Up 17 hours                                         k8s_POD_kube-proxy-ip-172-20-40-154.us-east-2.compute.internal_kube-system_7f95b8c170f60baf78f18d3badc82c91_0
584344c6a984        k8s.gcr.io/pause-amd64:3.0           "/pause"                 17 hours ago        Up 17 hours                                         k8s_POD_etcd-manager-main-ip-172-20-40-154.us-east-2.compute.internal_kube-system_f6bf9d005285b23146cbeb12fe6db5a3_0
4579e016fce6        k8s.gcr.io/pause-amd64:3.0           "/pause"                 17 hours ago        Up 17 hours                                         k8s_POD_kube-apiserver-ip-172-20-40-154.us-east-2.compute.internal_kube-system_83d0b8923ed5310d94d032547c6eeb6e_0

docker logs for api server (master server container )

Flag --basic-auth-file has been deprecated, Basic authentication mode is deprecated and will be removed in a future release. It is not recommended for production environments.
Flag --insecure-bind-address has been deprecated, This flag will be removed in a future version.
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0723 05:46:45.227033       1 flags.go:33] FLAG: --add-dir-header="false"
I0723 05:46:45.227177       1 flags.go:33] FLAG: --address="127.0.0.1"
I0723 05:46:45.227187       1 flags.go:33] FLAG: --admission-control="[]"
I0723 05:46:45.227197       1 flags.go:33] FLAG: --admission-control-config-file=""
I0723 05:46:45.227202       1 flags.go:33] FLAG: --advertise-address="<nil>"
I0723 05:46:45.227207       1 flags.go:33] FLAG: --allow-privileged="true"
I0723 05:46:45.227213       1 flags.go:33] FLAG: --alsologtostderr="true"
I0723 05:46:45.227218       1 flags.go:33] FLAG: --anonymous-auth="false"
I0723 05:46:45.227223       1 flags.go:33] FLAG: --api-audiences="[]"
I0723 05:46:45.227229       1 flags.go:33] FLAG: --apiserver-count="1"
I0723 05:46:45.227235       1 flags.go:33] FLAG: --audit-dynamic-configuration="false"
I0723 05:46:45.227240       1 flags.go:33] FLAG: --audit-log-batch-buffer-size="10000"
I0723 05:46:45.227245       1 flags.go:33] FLAG: --audit-log-batch-max-size="1"
I0723 05:46:45.227249       1 flags.go:33] FLAG: --audit-log-batch-max-wait="0s"
I0723 05:46:45.227255       1 flags.go:33] FLAG: --audit-log-batch-throttle-burst="0"
I0723 05:46:45.227259       1 flags.go:33] FLAG: --audit-log-batch-throttle-enable="false"
I0723 05:46:45.227264       1 flags.go:33] FLAG: --audit-log-batch-throttle-qps="0"
I0723 05:46:45.227271       1 flags.go:33] FLAG: --audit-log-format="json"
I0723 05:46:45.227276       1 flags.go:33] FLAG: --audit-log-maxage="0"
I0723 05:46:45.227280       1 flags.go:33] FLAG: --audit-log-maxbackup="0"
I0723 05:46:45.227285       1 flags.go:33] FLAG: --audit-log-maxsize="0"
I0723 05:46:45.227289       1 flags.go:33] FLAG: --audit-log-mode="blocking"
I0723 05:46:45.227294       1 flags.go:33] FLAG: --audit-log-path=""
I0723 05:46:45.227298       1 flags.go:33] FLAG: --audit-log-truncate-enabled="false"
I0723 05:46:45.227303       1 flags.go:33] FLAG: --audit-log-truncate-max-batch-size="10485760"
I0723 05:46:45.227310       1 flags.go:33] FLAG: --audit-log-truncate-max-event-size="102400"
I0723 05:46:45.227315       1 flags.go:33] FLAG: --audit-log-version="audit.k8s.io/v1"
I0723 05:46:45.227320       1 flags.go:33] FLAG: --audit-policy-file=""
I0723 05:46:45.227324       1 flags.go:33] FLAG: --audit-webhook-batch-buffer-size="10000"
I0723 05:46:45.227329       1 flags.go:33] FLAG: --audit-webhook-batch-initial-backoff="10s"
I0723 05:46:45.227334       1 flags.go:33] FLAG: --audit-webhook-batch-max-size="400"
I0723 05:46:45.227339       1 flags.go:33] FLAG: --audit-webhook-batch-max-wait="30s"
I0723 05:46:45.227343       1 flags.go:33] FLAG: --audit-webhook-batch-throttle-burst="15"
I0723 05:46:45.227348       1 flags.go:33] FLAG: --audit-webhook-batch-throttle-enable="true"
I0723 05:46:45.227352       1 flags.go:33] FLAG: --audit-webhook-batch-throttle-qps="10"
I0723 05:46:45.227358       1 flags.go:33] FLAG: --audit-webhook-config-file=""
I0723 05:46:45.227364       1 flags.go:33] FLAG: --audit-webhook-initial-backoff="10s"
I0723 05:46:45.227368       1 flags.go:33] FLAG: --audit-webhook-mode="batch"
I0723 05:46:45.227373       1 flags.go:33] FLAG: --audit-webhook-truncate-enabled="false"
I0723 05:46:45.227378       1 flags.go:33] FLAG: --audit-webhook-truncate-max-batch-size="10485760"
I0723 05:46:45.227382       1 flags.go:33] FLAG: --audit-webhook-truncate-max-event-size="102400"
I0723 05:46:45.227387       1 flags.go:33] FLAG: --audit-webhook-version="audit.k8s.io/v1"
I0723 05:46:45.227392       1 flags.go:33] FLAG: --authentication-token-webhook-cache-ttl="2m0s"
I0723 05:46:45.227397       1 flags.go:33] FLAG: --authentication-token-webhook-config-file=""
I0723 05:46:45.227402       1 flags.go:33] FLAG: --authorization-mode="[RBAC]"
I0723 05:46:45.227411       1 flags.go:33] FLAG: --authorization-policy-file=""
I0723 05:46:45.227415       1 flags.go:33] FLAG: --authorization-webhook-cache-authorized-ttl="5m0s"
I0723 05:46:45.227420       1 flags.go:33] FLAG: --authorization-webhook-cache-unauthorized-ttl="30s"
I0723 05:46:45.227425       1 flags.go:33] FLAG: --authorization-webhook-config-file=""
I0723 05:46:45.227429       1 flags.go:33] FLAG: --basic-auth-file="/srv/kubernetes/basic_auth.csv"
I0723 05:46:45.229308       1 flags.go:33] FLAG: --bind-address="0.0.0.0"
I0723 05:46:45.229321       1 flags.go:33] FLAG: --cert-dir="/var/run/kubernetes"
I0723 05:46:45.229328       1 flags.go:33] FLAG: --client-ca-file="/srv/kubernetes/ca.crt"
I0723 05:46:45.229333       1 flags.go:33] FLAG: --cloud-config=""
I0723 05:46:45.229338       1 flags.go:33] FLAG: --cloud-provider="aws"
I0723 05:46:45.229343       1 flags.go:33] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0723 05:46:45.229357       1 flags.go:33] FLAG: --contention-profiling="false"
I0723 05:46:45.229364       1 flags.go:33] FLAG: --cors-allowed-origins="[]"
I0723 05:46:45.229373       1 flags.go:33] FLAG: --default-not-ready-toleration-seconds="300"
I0723 05:46:45.229379       1 flags.go:33] FLAG: --default-unreachable-toleration-seconds="300"
I0723 05:46:45.229383       1 flags.go:33] FLAG: --default-watch-cache-size="100"
I0723 05:46:45.229389       1 flags.go:33] FLAG: --delete-collection-workers="1"
I0723 05:46:45.229393       1 flags.go:33] FLAG: --deserialization-cache-size="0"
I0723 05:46:45.229398       1 flags.go:33] FLAG: --disable-admission-plugins="[]"
I0723 05:46:45.229403       1 flags.go:33] FLAG: --egress-selector-config-file=""
I0723 05:46:45.229408       1 flags.go:33] FLAG: --enable-admission-plugins="[NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,NodeRestriction,ResourceQuota]"
I0723 05:46:45.229426       1 flags.go:33] FLAG: --enable-aggregator-routing="false"
I0723 05:46:45.229436       1 flags.go:33] FLAG: --enable-bootstrap-token-auth="false"
I0723 05:46:45.229441       1 flags.go:33] FLAG: --enable-garbage-collector="true"
I0723 05:46:45.229445       1 flags.go:33] FLAG: --enable-inflight-quota-handler="false"
I0723 05:46:45.229450       1 flags.go:33] FLAG: --enable-logs-handler="true"
I0723 05:46:45.229455       1 flags.go:33] FLAG: --enable-swagger-ui="false"
I0723 05:46:45.229460       1 flags.go:33] FLAG: --encryption-provider-config=""
I0723 05:46:45.229464       1 flags.go:33] FLAG: --endpoint-reconciler-type="lease"
I0723 05:46:45.229469       1 flags.go:33] FLAG: --etcd-cafile="/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt"
I0723 05:46:45.229475       1 flags.go:33] FLAG: --etcd-certfile="/etc/kubernetes/pki/kube-apiserver/etcd-client.crt"
I0723 05:46:45.229481       1 flags.go:33] FLAG: --etcd-compaction-interval="5m0s"
I0723 05:46:45.229487       1 flags.go:33] FLAG: --etcd-count-metric-poll-period="1m0s"
I0723 05:46:45.229491       1 flags.go:33] FLAG: --etcd-keyfile="/etc/kubernetes/pki/kube-apiserver/etcd-client.key"
I0723 05:46:45.229499       1 flags.go:33] FLAG: --etcd-prefix="/registry"
I0723 05:46:45.229504       1 flags.go:33] FLAG: --etcd-servers="[https://127.0.0.1:4001]"
I0723 05:46:45.229513       1 flags.go:33] FLAG: --etcd-servers-overrides="[/events#https://127.0.0.1:4002]"
I0723 05:46:45.229520       1 flags.go:33] FLAG: --event-ttl="1h0m0s"
I0723 05:46:45.229525       1 flags.go:33] FLAG: --experimental-encryption-provider-config=""
I0723 05:46:45.229530       1 flags.go:33] FLAG: --external-hostname=""
I0723 05:46:45.229534       1 flags.go:33] FLAG: --feature-gates=""
I0723 05:46:45.229552       1 flags.go:33] FLAG: --help="false"
I0723 05:46:45.229557       1 flags.go:33] FLAG: --http2-max-streams-per-connection="0"
I0723 05:46:45.229564       1 flags.go:33] FLAG: --insecure-bind-address="127.0.0.1"
I0723 05:46:45.229569       1 flags.go:33] FLAG: --insecure-port="8080"
I0723 05:46:45.229574       1 flags.go:33] FLAG: --kubelet-certificate-authority=""
I0723 05:46:45.229578       1 flags.go:33] FLAG: --kubelet-client-certificate="/srv/kubernetes/kubelet-api.pem"
I0723 05:46:45.229583       1 flags.go:33] FLAG: --kubelet-client-key="/srv/kubernetes/kubelet-api-key.pem"
I0723 05:46:45.229589       1 flags.go:33] FLAG: --kubelet-https="true"
I0723 05:46:45.229593       1 flags.go:33] FLAG: --kubelet-port="10250"
I0723 05:46:45.229600       1 flags.go:33] FLAG: --kubelet-preferred-address-types="[InternalIP,Hostname,ExternalIP]"
I0723 05:46:45.229606       1 flags.go:33] FLAG: --kubelet-read-only-port="10255"
I0723 05:46:45.229611       1 flags.go:33] FLAG: --kubelet-timeout="5s"
I0723 05:46:45.229616       1 flags.go:33] FLAG: --kubernetes-service-node-port="0"
I0723 05:46:45.229621       1 flags.go:33] FLAG: --livez-grace-period="0s"
I0723 05:46:45.229625       1 flags.go:33] FLAG: --log-backtrace-at=":0"
I0723 05:46:45.229633       1 flags.go:33] FLAG: --log-dir=""
I0723 05:46:45.229638       1 flags.go:33] FLAG: --log-file="/var/log/kube-apiserver.log"
I0723 05:46:45.229643       1 flags.go:33] FLAG: --log-file-max-size="1800"
I0723 05:46:45.229648       1 flags.go:33] FLAG: --log-flush-frequency="5s"
I0723 05:46:45.229653       1 flags.go:33] FLAG: --logtostderr="false"
I0723 05:46:45.229658       1 flags.go:33] FLAG: --master-service-namespace="default"
I0723 05:46:45.229662       1 flags.go:33] FLAG: --max-connection-bytes-per-sec="0"
I0723 05:46:45.229668       1 flags.go:33] FLAG: --max-mutating-requests-inflight="200"
I0723 05:46:45.229672       1 flags.go:33] FLAG: --max-requests-inflight="400"
I0723 05:46:45.229677       1 flags.go:33] FLAG: --min-request-timeout="1800"
I0723 05:46:45.229681       1 flags.go:33] FLAG: --oidc-ca-file=""
I0723 05:46:45.229686       1 flags.go:33] FLAG: --oidc-client-id=""
I0723 05:46:45.229691       1 flags.go:33] FLAG: --oidc-groups-claim=""
I0723 05:46:45.229695       1 flags.go:33] FLAG: --oidc-groups-prefix=""
I0723 05:46:45.229699       1 flags.go:33] FLAG: --oidc-issuer-url=""
I0723 05:46:45.229704       1 flags.go:33] FLAG: --oidc-required-claim=""
I0723 05:46:45.229711       1 flags.go:33] FLAG: --oidc-signing-algs="[RS256]"
I0723 05:46:45.229718       1 flags.go:33] FLAG: --oidc-username-claim="sub"
I0723 05:46:45.229723       1 flags.go:33] FLAG: --oidc-username-prefix=""
I0723 05:46:45.229727       1 flags.go:33] FLAG: --port="8080"
I0723 05:46:45.229732       1 flags.go:33] FLAG: --profiling="true"
I0723 05:46:45.229736       1 flags.go:33] FLAG: --proxy-client-cert-file="/srv/kubernetes/apiserver-aggregator.cert"
I0723 05:46:45.229742       1 flags.go:33] FLAG: --proxy-client-key-file="/srv/kubernetes/apiserver-aggregator.key"
I0723 05:46:45.229747       1 flags.go:33] FLAG: --request-timeout="1m0s"
I0723 05:46:45.229752       1 flags.go:33] FLAG: --requestheader-allowed-names="[aggregator]"
I0723 05:46:45.229763       1 flags.go:33] FLAG: --requestheader-client-ca-file="/srv/kubernetes/apiserver-aggregator-ca.cert"
I0723 05:46:45.229769       1 flags.go:33] FLAG: --requestheader-extra-headers-prefix="[X-Remote-Extra-]"
I0723 05:46:45.229780       1 flags.go:33] FLAG: --requestheader-group-headers="[X-Remote-Group]"
I0723 05:46:45.229786       1 flags.go:33] FLAG: --requestheader-username-headers="[X-Remote-User]"
I0723 05:46:45.229793       1 flags.go:33] FLAG: --runtime-config=""
I0723 05:46:45.229800       1 flags.go:33] FLAG: --secure-port="443"
I0723 05:46:45.229805       1 flags.go:33] FLAG: --service-account-api-audiences="[]"
I0723 05:46:45.229811       1 flags.go:33] FLAG: --service-account-issuer=""
I0723 05:46:45.229815       1 flags.go:33] FLAG: --service-account-key-file="[]"
I0723 05:46:45.229824       1 flags.go:33] FLAG: --service-account-lookup="true"
I0723 05:46:45.229828       1 flags.go:33] FLAG: --service-account-max-token-expiration="0s"
I0723 05:46:45.229833       1 flags.go:33] FLAG: --service-account-signing-key-file=""
I0723 05:46:45.229838       1 flags.go:33] FLAG: --service-cluster-ip-range="100.64.0.0/13"
I0723 05:46:45.229842       1 flags.go:33] FLAG: --service-node-port-range="30000-32767"
I0723 05:46:45.229849       1 flags.go:33] FLAG: --shutdown-delay-duration="0s"
I0723 05:46:45.229854       1 flags.go:33] FLAG: --skip-headers="false"
I0723 05:46:45.229859       1 flags.go:33] FLAG: --skip-log-headers="false"
I0723 05:46:45.229863       1 flags.go:33] FLAG: --ssh-keyfile=""
I0723 05:46:45.229868       1 flags.go:33] FLAG: --ssh-user=""
I0723 05:46:45.229872       1 flags.go:33] FLAG: --stderrthreshold="2"
I0723 05:46:45.229877       1 flags.go:33] FLAG: --storage-backend="etcd3"
I0723 05:46:45.229882       1 flags.go:33] FLAG: --storage-media-type="application/vnd.kubernetes.protobuf"
I0723 05:46:45.229887       1 flags.go:33] FLAG: --target-ram-mb="0"
I0723 05:46:45.229891       1 flags.go:33] FLAG: --tls-cert-file="/srv/kubernetes/server.cert"
I0723 05:46:45.229896       1 flags.go:33] FLAG: --tls-cipher-suites="[]"
I0723 05:46:45.229919       1 flags.go:33] FLAG: --tls-min-version=""
I0723 05:46:45.229925       1 flags.go:33] FLAG: --tls-private-key-file="/srv/kubernetes/server.key"
I0723 05:46:45.229930       1 flags.go:33] FLAG: --tls-sni-cert-key="[]"
I0723 05:46:45.229936       1 flags.go:33] FLAG: --token-auth-file="/srv/kubernetes/known_tokens.csv"
I0723 05:46:45.229941       1 flags.go:33] FLAG: --v="2"
I0723 05:46:45.229946       1 flags.go:33] FLAG: --version="false"
I0723 05:46:45.229956       1 flags.go:33] FLAG: --vmodule=""
I0723 05:46:45.229961       1 flags.go:33] FLAG: --watch-cache="true"
I0723 05:46:45.229966       1 flags.go:33] FLAG: --watch-cache-sizes="[]"
I0723 05:46:45.230324       1 server.go:623] external host was not specified, using 172.20.40.154
I0723 05:46:45.230486       1 server.go:666] Initializing cache sizes based on 0MB limit
I0723 05:46:45.230794       1 server.go:149] Version: v1.16.8
W0723 05:46:45.630460       1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0723 05:46:45.630964       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0723 05:46:45.630981       1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
W0723 05:46:45.631389       1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0723 05:46:45.631668       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0723 05:46:45.631681       1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I0723 05:46:45.634036       1 client.go:357] parsed scheme: "endpoint"
I0723 05:46:45.634088       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:4001 0  <nil>}]
W0723 05:46:45.634800       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
I0723 05:46:46.629812       1 client.go:357] parsed scheme: "endpoint"
I0723 05:46:46.629849       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:4001 0  <nil>}]
W0723 05:46:46.630180       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:46.635152       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:47.630636       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:47.969816       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:48.954635       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:50.067633       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:52.000846       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:53.434147       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:56.479097       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:46:59.148647       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
W0723 05:47:03.442051       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:4001: connect: connection refused". Reconnecting...
panic: context deadline exceeded

goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/registry/customresourcedefinition.NewREST(0xc0003c58f0, 0x4e6da40, 0xc000561680, 0xc0005618a8)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/registry/customresourcedefinition/etcd.go:56 +0x3c1
k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/apiserver.completedConfig.New(0xc00000d780, 0xc000861e48, 0x4f28180, 0x73f92c8, 0x10, 0x0, 0x0)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:147 +0x152b
k8s.io/kubernetes/cmd/kube-apiserver/app.createAPIExtensionsServer(0xc000861e40, 0x4f28180, 0x73f92c8, 0x0, 0x4e6d6a0, 0xc0001a71d0)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/apiextensions.go:95 +0x59
k8s.io/kubernetes/cmd/kube-apiserver/app.CreateServerChain(0xc0004d2000, 0xc000096360, 0x43ee6c0, 0xc, 0xc000795c48)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:182 +0x2bc
k8s.io/kubernetes/cmd/kube-apiserver/app.Run(0xc0004d2000, 0xc000096360, 0x0, 0x0)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:151 +0x101
k8s.io/kubernetes/cmd/kube-apiserver/app.NewAPIServerCommand.func1(0xc00011f900, 0xc000689680, 0x0, 0x24, 0x0, 0x0)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:118 +0x104
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc00011f900, 0xc0000ba010, 0x24, 0x27, 0xc00011f900, 0xc0000ba010)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:826 +0x460
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00011f900, 0x16244b859c9c5d05, 0x73db3a0, 0xc00006a750)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914 +0x2fb
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
    /workspace/anago-v1.16.8-beta.0.65+ef1ba35b1a4560/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
main.main()
    _output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-apiserver/apiserver.go:43 +0xcd

hakman commented 4 years ago

This log indicates that kube-apiserver cannot connect to the etcd-manager pods (https://127.0.0.1:4001). You may want to check their logs and also the protokube container logs.

bravitejareddy commented 4 years ago

protokube logs:

`Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.284090 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.384374 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.456419 9297 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.484675 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.584918 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.656410 9297 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list v1.Pod: Get https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-20-40-154.us-east-2.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.685168 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.785464 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.856367 9297 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSIDriver: Get https://127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused Jul 22 20:33:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:33:27.885750 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found I0722 20:33:27.924398 9264 kube_boot.go:197] kubelet systemd service already running I0722 20:33:27.924408 9264 labeler.go:35] Querying k8s for node "ip-172-20-40-154.us-east-2.compute.internal" W0722 20:33:27.924753 9264 kube_boot.go:155] error bootstrapping master node labels: error querying node "ip-172-20-40-154.us-east-2.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-172-20-40-154.us-east-2.compute.internal: dial tcp 127.0.0.1:443: connect: connection refused W0722 20:33:27.925130 9264 rbac.go:52] Error configuring RBAC: error creating service accounts: Post https://127.0.0.1/api/v1/namespaces/kube-system/serviceaccounts: dial tcp 127.0.0.1:443: connect: connection refused W0722 20:33:27.925140 9264 rbac.go:52] Error configuring RBAC: error creating cluster role bindings: unable to create RBAC clusterrolebinding: Post https://127.0.0.1/apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings: dial tcp 127.0.0.1:443: connect: connection refused W0722 20:33:27.925146 9264 kube_boot.go:165] error initializing rbac: error creating service accounts: Post https://127.0.0.1/api/v1/namespaces/kube-system/serviceaccounts: dial tcp 127.0.0.1:443: connect: connection refused I0722 20:33:27.925154 9264 channels.go:31] checking channel: "s3://example.com/cluster.example.com/addons/bootstrap-channel.yaml" I0722 20:33:27.925185 9264 channels.go:45] Running command: channels apply channel s3://example.com/cluster.example.com/addons/bootstrap-channel.yaml --v=4 --yes I0722 20:33:27.949095 9264 channels.go:48] error running channels apply channel s3://example.com/cluster.example.com/addons/bootstrap-channel.yaml --v=4 --yes: I0722 20:33:27.949123 9264 channels.go:49] Error: error querying kubernetes version: Get https://127.0.0.1/version?timeout=32s: dial tcp 127.0.0.1:443: connect: connection refused Usage: channels apply channel [flags]

Flags: -f, --filename strings Apply from a local file -h, --help help for channel --yes Apply update

Global Flags: --alsologtostderr log to standard error as well as files --config string config file (default is $HOME/.channels.yaml) --log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0) --log_dir string If non-empty, write log files in this directory --log_file string If non-empty, use this log file --log_file_max_size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800) --logtostderr log to standard error instead of files (default true) --skip_headers If true, avoid header prefixes in the log messages --skip_log_headers If true, avoid headers when opening log files --stderrthreshold severity logs at or above this threshold go to stderr (default 2) -v, --v Level number for the log level verbosity (default 0) --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging

error querying kubernetes version: Get https://127.0.0.1/version?timeout=32s: dial tcp 127.0.0.1:443: connect: connection refused I0722 20:33:27.949138 9264 channels.go:34] apply channel output was: Error: error querying kubernetes version: Get https://127.0.0.1/version?timeout=32s: dial tcp 127.0.0.1:443: connect: connection refused Usage: channels apply channel [flags]

Flags: -f, --filename strings Apply from a local file -h, --help help for channel --yes Apply update

Global Flags: --alsologtostderr log to standard error as well as files --config string config file (default is $HOME/.channels.yaml) --log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0) --log_dir string If non-empty, write log files in this directory --log_file string If non-empty, use this log file --log_file_max_size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800) --logtostderr log to standard error instead of files (default true) --skip_headers If true, avoid header prefixes in the log messages --skip_log_headers If true, avoid headers when opening log files --stderrthreshold severity logs at or above this threshold go to stderr (default 2) -v, --v Level number for the log level verbosity (default 0) --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging

error querying kubernetes version: Get https://127.0.0.1/version?timeout=32s: dial tcp 127.0.0.1:443: connect: connection refused W0722 20:33:27.949147 9264 kube_boot.go:170] error applying channel "s3://example.com/cluster.example.com/addons/bootstrap-channel.yaml": error running channels: exit status 1 I0722 20:34:27.949372 9264 kube_boot.go:141] protokube management of etcd not enabled; won't scan for volumes I0722 20:34:27.949406 9264 kube_boot.go:182] ensuring that kubelet systemd service is running I0722 20:34:27.956494 9264 kube_boot.go:195] 'systemctl status kubelet' output: ● kubelet.service - Kubernetes Kubelet Server Loaded: loaded (/usr/lib/systemd/system/kubelet.service; static; vendor preset: disabled) Active: active (running) since Wed 2020-07-22 12:04:01 UTC; 8h ago Docs: https://github.com/kubernetes/kubernetes Main PID: 9297 (kubelet) Tasks: 17 Memory: 41.8M CGroup: /system.slice/kubelet.service └─9297 /usr/local/bin/kubelet --anonymous-auth=false --cgroup-root=/ --client-ca-file=/srv/kubernetes/ca.crt --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-debugging-handlers=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% --hostname-override=ip-172-20-40-154.us-east-2.compute.internal --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=k8s.gcr.io/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --register-schedulable=true --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --v=2 --volume-plugin-dir=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/

Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.854558 9297 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.854594 9297 kubelet_node_status.go:334] Adding node label from cloud provider: beta.kubernetes.io/instance-type=t2.medium Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.854604 9297 kubelet_node_status.go:345] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=us-east-2a Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.854611 9297 kubelet_node_status.go:349] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=us-east-2 Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.856085 9297 kubelet_node_status.go:472] Recording NodeHasSufficientMemory event message for node ip-172-20-40-154.us-east-2.compute.internal Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.856110 9297 kubelet_node_status.go:472] Recording NodeHasNoDiskPressure event message for node ip-172-20-40-154.us-east-2.compute.internal Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.856125 9297 kubelet_node_status.go:472] Recording NodeHasSufficientPID event message for node ip-172-20-40-154.us-east-2.compute.internal Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: I0722 20:34:27.856149 9297 kubelet_node_status.go:72] Attempting to register node ip-172-20-40-154.us-east-2.compute.internal Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:34:27.856451 9297 kubelet_node_status.go:94] Unable to register node "ip-172-20-40-154.us-east-2.compute.internal" with API server: Post https://127.0.0.1/api/v1/nodes: dial tcp 127.0.0.1:443: connect: connection refused Jul 22 20:34:27 ip-172-20-40-154.us-east-2.compute.internal kubelet[9297]: E0722 20:34:27.925506 9297 kubelet.go:2267] node "ip-172-20-40-154.us-east-2.compute.internal" not found I0722 20:34:27.956523 9264 kube_boot.go:197] kubelet systemd service already running I0722 20:34:27.956533 9264 labeler.go:35] Querying k8s for node "ip-172-20-40-154.us-east-2.compute.internal" W0722 20:34:27.956878 9264 kube_boot.go:155] error bootstrapping master node labels: error querying node "ip-172-20-40-154.us-east-2.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-172-20-40-154.us-east-2.compute.internal: dial tcp 127.0.0.1:443: connect: connection refused W0722 20:34:27.957286 9264 rbac.go:52] Error configuring RBAC: error creating service accounts: Post https://127.0.0.1/api/v1/namespaces/kube-system/serviceaccounts: dial tcp 127.0.0.1:443: connect: connection refused W0722 20:34:27.957296 9264 rbac.go:52] Error configuring RBAC: error creating cluster role bindings: unable to create RBAC clusterrolebinding: Post https://127.0.0.1/apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings: dial tcp 127.0.0.1:443: connect: connection refused W0722 20:34:27.957302 9264 kube_boot.go:165] error initializing rbac: error creating service accounts: Post https://127.0.0.1/api/v1/namespaces/kube-system/serviceaccounts: dial tcp 127.0.0.1:443: connect: connection refused I0722 20:34:27.957308 9264 channels.go:31] checking channel: "s3://example.com/cluster.example.com/addons/bootstrap-channel.yaml" I0722 20:34:27.957341 9264 channels.go:45] Running command: channels apply channel s3://example.com/cluster.example.com/addons/bootstrap-channel.yaml --v=4 --yes I0722 20:34:27.973537 9264 channels.go:48] error running channels apply channel s3://example.com/cluster.example.com/addons/bootstrap-channel.yaml --v=4 --yes: I0722 20:34:27.973563 9264 channels.go:49] Error: error querying kubernetes version: Get https://127.0.0.1/version?timeout=32s: dial tcp 127.0.0.1:443: connect: connection refused`

etcd-manager-main-logs:

I0723 06:53:01.263325 9962 controller.go:277] etcd cluster members: map[] I0723 06:53:01.263337 9962 controller.go:615] sending member map to all peers: I0723 06:53:01.263496 9962 commands.go:22] not refreshing commands - TTL not hit I0723 06:53:01.263509 9962 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/main/control/etcd-cluster-created" I0723 06:53:01.273514 9962 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:53:01.273610 9962 controller.go:380] got restore-backup command: timestamp:1595485522319633081 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:02:40Z-003982" > I0723 06:53:01.273642 9962 controller.go:615] sending member map to all peers: members:<name:"etcd-a" dns:"etcd-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:53:01.273867 9962 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:01.273885 9962 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:01.280428 9962 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"172.20.40.154:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" > }] W0723 06:53:01.280680 9962 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer "peer{id:\"etcd-a\" endpoints:\"172.20.40.154:3996\" }": rpc error: code = Unknown desc = concurrent prepare in progress "V56UAvmGhGLas1BYVWrddQ" I0723 06:53:01.558221 9962 volumes.go:85] AWS API Request: ec2/DescribeVolumes I0723 06:53:01.650757 9962 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:01.650844 9962 hosts.go:181] skipping update of unchanged /etc/hosts I0723 06:53:11.282101 9962 controller.go:173] starting controller iteration I0723 06:53:11.282136 9962 controller.go:269] I am leader with token "MLl82eWm_RiMuSn3vOiWCw" I0723 06:53:11.282472 9962 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"172.20.40.154:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" > } I0723 06:53:11.282523 9962 controller.go:277] etcd cluster members: map[] I0723 06:53:11.282536 9962 controller.go:615] sending member map to all peers: I0723 06:53:11.282690 9962 commands.go:22] not refreshing commands - TTL not hit I0723 06:53:11.282706 9962 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/main/control/etcd-cluster-created" I0723 06:53:11.311730 9962 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:53:11.311802 9962 controller.go:380] got restore-backup command: timestamp:1595485522319633081 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:02:40Z-003982" > I0723 06:53:11.311834 9962 controller.go:615] sending member map to all peers: members:<name:"etcd-a" dns:"etcd-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:53:11.312094 9962 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:11.312116 9962 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:11.328722 9962 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"172.20.40.154:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" > }] I0723 06:53:11.328952 9962 etcdserver.go:261] preparation "V56UAvmGhGLas1BYVWrddQ" expired I0723 06:53:11.329056 9962 newcluster.go:137] JoinClusterResponse: W0723 06:53:11.329256 9962 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest from peer "peer{id:\"etcd-a\" endpoints:\"172.20.40.154:3996\" }": rpc error: code = Unknown desc = error writing state file "/rootfs/mnt/master-vol-02c3935307d9055c9/state": open /rootfs/mnt/master-vol-02c3935307d9055c9/state: read-only file system I0723 06:53:17.319676 9962 etcdserver.go:534] starting etcd with state new_cluster:true cluster:<cluster_token:"sFSXwHBwYBkD8ZBiNVahFA" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" tls_enabled:true > > etcd_version:"3.2.24" quarantined:true I0723 06:53:17.320072 9962 etcdserver.go:543] starting etcd with datadir /rootfs/mnt/master-vol-02c3935307d9055c9/data/sFSXwHBwYBkD8ZBiNVahFA W0723 06:53:17.320160 9962 etcdserver.go:96] error running etcd: error creating directories for certificate file "/rootfs/mnt/master-vol-02c3935307d9055c9/pki/sFSXwHBwYBkD8ZBiNVahFA/peers/ca.crt": mkdir /rootfs/mnt/master-vol-02c3935307d9055c9/pki/sFSXwHBwYBkD8ZBiNVahFA: read-only file system I0723 06:53:21.330625 9962 controller.go:173] starting controller iteration I0723 06:53:21.330659 9962 controller.go:269] I am leader with token "MLl82eWm_RiMuSn3vOiWCw" W0723 06:53:26.331409 9962 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"172.20.40.154:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" > etcd_state:<new_cluster:true cluster:<cluster_token:"sFSXwHBwYBkD8ZBiNVahFA" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" tls_enabled:true > > etcd_version:"3.2.24" quarantined:true > }: error building etcd client for https://etcd-a.internal.cluster.example.com:3994: dial tcp 172.20.40.154:3994: connect: connection refused I0723 06:53:26.331483 9962 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"172.20.40.154:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" > etcd_state:<new_cluster:true cluster:<cluster_token:"sFSXwHBwYBkD8ZBiNVahFA" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" tls_enabled:true > > etcd_version:"3.2.24" quarantined:true > } I0723 06:53:26.331519 9962 controller.go:277] etcd cluster members: map[] I0723 06:53:26.331533 9962 controller.go:615] sending member map to all peers: members:<name:"etcd-a" dns:"etcd-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:53:26.331787 9962 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:26.331803 9962 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:26.340317 9962 commands.go:22] not refreshing commands - TTL not hit I0723 06:53:26.340339 9962 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/main/control/etcd-cluster-created" I0723 06:53:26.354217 9962 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:53:26.354284 9962 controller.go:380] got restore-backup command: timestamp:1595485522319633081 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:02:40Z-003982" > I0723 06:53:26.354533 9962 etcdserver.go:439] StopEtcd request: header:<leadership_token:"MLl82eWm_RiMuSn3vOiWCw" cluster_name:"etcd" > I0723 06:53:26.354558 9962 etcdserver.go:453] Stopping etcd for stop request: header:<leadership_token:"MLl82eWm_RiMuSn3vOiWCw" cluster_name:"etcd" > W0723 06:53:26.354670 9962 controller.go:149] unexpected error running etcd cluster reconciliation loop: error stopping etcd peer "etcd-a": rpc error: code = Unknown desc = error writing state file "/rootfs/mnt/master-vol-02c3935307d9055c9/state": open /rootfs/mnt/master-vol-02c3935307d9055c9/state: read-only file system I0723 06:53:36.356025 9962 controller.go:173] starting controller iteration I0723 06:53:36.356064 9962 controller.go:269] I am leader with token "MLl82eWm_RiMuSn3vOiWCw" I0723 06:53:36.356452 9962 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"172.20.40.154:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" > } I0723 06:53:36.356501 9962 controller.go:277] etcd cluster members: map[] I0723 06:53:36.356514 9962 controller.go:615] sending member map to all peers: I0723 06:53:36.356670 9962 commands.go:22] not refreshing commands - TTL not hit I0723 06:53:36.356684 9962 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/main/control/etcd-cluster-created" I0723 06:53:36.365934 9962 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:53:36.365993 9962 controller.go:380] got restore-backup command: timestamp:1595485522319633081 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:02:40Z-003982" > I0723 06:53:36.366025 9962 controller.go:615] sending member map to all peers: members:<name:"etcd-a" dns:"etcd-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:53:36.366228 9962 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:36.366253 9962 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-a.internal.cluster.example.com]] I0723 06:53:36.384848 9962 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"172.20.40.154:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.cluster.example.com:2380" client_urls:"https://etcd-a.internal.cluster.example.com:4001" quarantined_client_urls:"https://etcd-a.internal.cluster.example.com:3994" > }] W0723 06:53:36.385152 9962 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer "peer{id:\"etcd-a\" endpoints:\"172.20.40.154:3996\" }": rpc error: code = Unknown desc = concurrent prepare in progress "sFSXwHBwYBkD8ZBiNVahFA"

etcd-manager-events-logs: rlock.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > } I0723 06:56:36.093332 9870 controller.go:277] etcd cluster members: map[] I0723 06:56:36.093344 9870 controller.go:615] sending member map to all peers: I0723 06:56:36.093502 9870 commands.go:22] not refreshing commands - TTL not hit I0723 06:56:36.093515 9870 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/events/control/etcd-cluster-created" I0723 06:56:36.102959 9870 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:56:36.103077 9870 controller.go:380] got restore-backup command: timestamp:1595485538507219591 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:01:14Z-003983" > I0723 06:56:36.103108 9870 controller.go:615] sending member map to all peers: members:<name:"etcd-events-a" dns:"etcd-events-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:56:36.103291 9870 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:56:36.103314 9870 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:56:36.109852 9870 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > }] W0723 06:56:36.110177 9870 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer "peer{id:\"etcd-events-a\" endpoints:\"172.20.40.154:3997\" }": rpc error: code = Unknown desc = concurrent prepare in progress "nd7Pdla6pC0gfqYP5Pg4qg" I0723 06:56:46.111555 9870 controller.go:173] starting controller iteration I0723 06:56:46.111584 9870 controller.go:269] I am leader with token "ORnb803zx5pvBSk9exFmQw" I0723 06:56:46.111933 9870 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > } I0723 06:56:46.111990 9870 controller.go:277] etcd cluster members: map[] I0723 06:56:46.112002 9870 controller.go:615] sending member map to all peers: I0723 06:56:46.112167 9870 commands.go:22] not refreshing commands - TTL not hit I0723 06:56:46.112181 9870 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/events/control/etcd-cluster-created" I0723 06:56:46.145430 9870 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:56:46.145500 9870 controller.go:380] got restore-backup command: timestamp:1595485538507219591 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:01:14Z-003983" > I0723 06:56:46.145530 9870 controller.go:615] sending member map to all peers: members:<name:"etcd-events-a" dns:"etcd-events-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:56:46.145721 9870 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:56:46.145747 9870 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:56:46.165402 9870 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > }] I0723 06:56:46.165592 9870 etcdserver.go:261] preparation "nd7Pdla6pC0gfqYP5Pg4qg" expired I0723 06:56:46.165692 9870 newcluster.go:137] JoinClusterResponse: W0723 06:56:46.165942 9870 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest from peer "peer{id:\"etcd-events-a\" endpoints:\"172.20.40.154:3997\" }": rpc error: code = Unknown desc = error writing state file "/rootfs/mnt/master-vol-031fdf75b28bdae6b/state": open /rootfs/mnt/master-vol-031fdf75b28bdae6b/state: read-only file system I0723 06:56:53.368585 9870 etcdserver.go:534] starting etcd with state new_cluster:true cluster:<cluster_token:"IMjKIesPIN0eQaFdP3SurA" nodes:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" tls_enabled:true > > etcd_version:"3.2.24" quarantined:true I0723 06:56:53.368649 9870 etcdserver.go:543] starting etcd with datadir /rootfs/mnt/master-vol-031fdf75b28bdae6b/data/IMjKIesPIN0eQaFdP3SurA W0723 06:56:53.368734 9870 etcdserver.go:96] error running etcd: error creating directories for certificate file "/rootfs/mnt/master-vol-031fdf75b28bdae6b/pki/IMjKIesPIN0eQaFdP3SurA/peers/ca.crt": mkdir /rootfs/mnt/master-vol-031fdf75b28bdae6b/pki/IMjKIesPIN0eQaFdP3SurA: read-only file system I0723 06:56:56.167323 9870 controller.go:173] starting controller iteration I0723 06:56:56.167362 9870 controller.go:269] I am leader with token "ORnb803zx5pvBSk9exFmQw" W0723 06:57:01.168047 9870 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > etcd_state:<new_cluster:true cluster:<cluster_token:"IMjKIesPIN0eQaFdP3SurA" nodes:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" tls_enabled:true > > etcd_version:"3.2.24" quarantined:true > }: error building etcd client for https://etcd-events-a.internal.cluster.example.com:3995: dial tcp 172.20.40.154:3995: connect: connection refused I0723 06:57:01.168124 9870 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > etcd_state:<new_cluster:true cluster:<cluster_token:"IMjKIesPIN0eQaFdP3SurA" nodes:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" tls_enabled:true > > etcd_version:"3.2.24" quarantined:true > } I0723 06:57:01.168187 9870 controller.go:277] etcd cluster members: map[] I0723 06:57:01.168202 9870 controller.go:615] sending member map to all peers: members:<name:"etcd-events-a" dns:"etcd-events-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:57:01.168464 9870 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:01.168481 9870 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:01.183113 9870 commands.go:22] not refreshing commands - TTL not hit I0723 06:57:01.183139 9870 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/events/control/etcd-cluster-created" I0723 06:57:01.195288 9870 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:57:01.195355 9870 controller.go:380] got restore-backup command: timestamp:1595485538507219591 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:01:14Z-003983" > I0723 06:57:01.195594 9870 etcdserver.go:439] StopEtcd request: header:<leadership_token:"ORnb803zx5pvBSk9exFmQw" cluster_name:"etcd-events" > I0723 06:57:01.195615 9870 etcdserver.go:453] Stopping etcd for stop request: header:<leadership_token:"ORnb803zx5pvBSk9exFmQw" cluster_name:"etcd-events" > W0723 06:57:01.195721 9870 controller.go:149] unexpected error running etcd cluster reconciliation loop: error stopping etcd peer "etcd-events-a": rpc error: code = Unknown desc = error writing state file "/rootfs/mnt/master-vol-031fdf75b28bdae6b/state": open /rootfs/mnt/master-vol-031fdf75b28bdae6b/state: read-only file system I0723 06:57:02.153496 9870 volumes.go:85] AWS API Request: ec2/DescribeVolumes I0723 06:57:02.203608 9870 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:11.197123 9870 controller.go:173] starting controller iteration I0723 06:57:11.197152 9870 controller.go:269] I am leader with token "ORnb803zx5pvBSk9exFmQw" I0723 06:57:11.197490 9870 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > } I0723 06:57:11.197556 9870 controller.go:277] etcd cluster members: map[] I0723 06:57:11.197569 9870 controller.go:615] sending member map to all peers: I0723 06:57:11.197781 9870 commands.go:22] not refreshing commands - TTL not hit I0723 06:57:11.197803 9870 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/events/control/etcd-cluster-created" I0723 06:57:11.210235 9870 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:57:11.210300 9870 controller.go:380] got restore-backup command: timestamp:1595485538507219591 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:01:14Z-003983" > I0723 06:57:11.210329 9870 controller.go:615] sending member map to all peers: members:<name:"etcd-events-a" dns:"etcd-events-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:57:11.210692 9870 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:11.210711 9870 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:11.210779 9870 hosts.go:181] skipping update of unchanged /etc/hosts I0723 06:57:11.210922 9870 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > }] W0723 06:57:11.211161 9870 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer "peer{id:\"etcd-events-a\" endpoints:\"172.20.40.154:3997\" }": rpc error: code = Unknown desc = concurrent prepare in progress "IMjKIesPIN0eQaFdP3SurA" I0723 06:57:21.212538 9870 controller.go:173] starting controller iteration I0723 06:57:21.212576 9870 controller.go:269] I am leader with token "ORnb803zx5pvBSk9exFmQw" I0723 06:57:21.213004 9870 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > } I0723 06:57:21.213052 9870 controller.go:277] etcd cluster members: map[] I0723 06:57:21.213064 9870 controller.go:615] sending member map to all peers: I0723 06:57:21.213254 9870 commands.go:22] not refreshing commands - TTL not hit I0723 06:57:21.213267 9870 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/events/control/etcd-cluster-created" I0723 06:57:21.225003 9870 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:57:21.225070 9870 controller.go:380] got restore-backup command: timestamp:1595485538507219591 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:01:14Z-003983" > I0723 06:57:21.225102 9870 controller.go:615] sending member map to all peers: members:<name:"etcd-events-a" dns:"etcd-events-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:57:21.225332 9870 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:21.225351 9870 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:21.235906 9870 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > }] W0723 06:57:21.236201 9870 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer "peer{id:\"etcd-events-a\" endpoints:\"172.20.40.154:3997\" }": rpc error: code = Unknown desc = concurrent prepare in progress "IMjKIesPIN0eQaFdP3SurA" I0723 06:57:31.237572 9870 controller.go:173] starting controller iteration I0723 06:57:31.237612 9870 controller.go:269] I am leader with token "ORnb803zx5pvBSk9exFmQw" I0723 06:57:31.237983 9870 controller.go:276] etcd cluster state: etcdClusterState members: peers: etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > } I0723 06:57:31.238064 9870 controller.go:277] etcd cluster members: map[] I0723 06:57:31.238076 9870 controller.go:615] sending member map to all peers: I0723 06:57:31.238225 9870 commands.go:22] not refreshing commands - TTL not hit I0723 06:57:31.238238 9870 s3fs.go:220] Reading file "s3://example.com/cluster.example.com/backups/etcd/events/control/etcd-cluster-created" I0723 06:57:31.247932 9870 controller.go:369] spec member_count:1 etcd_version:"3.2.24" I0723 06:57:31.247992 9870 controller.go:380] got restore-backup command: timestamp:1595485538507219591 restore_backup:<cluster_spec:<member_count:1 etcd_version:"3.2.24" > backup:"2020-07-21T00:01:14Z-003983" > I0723 06:57:31.248020 9870 controller.go:615] sending member map to all peers: members:<name:"etcd-events-a" dns:"etcd-events-a.internal.cluster.example.com" addresses:"172.20.40.154" > I0723 06:57:31.248372 9870 etcdserver.go:226] updating hosts: map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:31.248396 9870 hosts.go:84] hosts update: primary=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]], fallbacks=map[], final=map[172.20.40.154:[etcd-events-a.internal.cluster.example.com]] I0723 06:57:31.255880 9870 newcluster.go:120] starting new etcd cluster with [etcdClusterPeerInfo{peer=peer{id:"etcd-events-a" endpoints:"172.20.40.154:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-a" peer_urls:"https://etcd-events-a.internal.cluster.example.com:2381" client_urls:"https://etcd-events-a.internal.cluster.example.com:4002" quarantined_client_urls:"https://etcd-events-a.internal.cluster.example.com:3995" > }] W0723 06:57:31.256154 9870 controller.go:149] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer "peer{id:\"etcd-events-a\" endpoints:\"172.20.40.154:3997\" }": rpc error: code = Unknown desc = concurrent prepare in progress "IMjKIesPIN0eQaFdP3SurA"

hakman commented 4 years ago

The suggestions above were in case you want to investigate more on your own and try to track down the issue. From my point of view, you are trying something that is more of a curiosity. In case you want to try something, you may want to remove the cluster completely, undelete the S3 versioned bucket and try to recover from S3 as explained in that doc. Even then, a rolling update may be required.

bravitejareddy commented 4 years ago

Sure @hakman thank you for your quick response I really appreciate it. I am planning multiple approaches to at least restore the cluster back to normal with all the deployments irrespective of any disaster like cluster deleted or the S3 state store deleted in that case how would be recovering the cluster back along with all the apps deployed.

hakman commented 4 years ago

Depends if your cluster has or not persistent storage, which may be deleted during cluster delete and (of course) unrecoverable.
A cluster that is half deleted is in an unknown state. There could be some leftovers that are not working correctly. This is why I suggest to delete fully, restore the S3 bucket and try again.
The S3 bucket should have versioning, so it allows to also undelete files: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/undelete-objects.html

bravitejareddy commented 4 years ago

For 2 options, unknown state. Do you mean to say delete the full s3 bucket of the old cluster which is in the unknown state from s3 bucket? If we delete the state store from S3 we cannot even recover?

For 3rd one maybe i need to enable versioning for the kops state store s3 bucket and create a backup-s3buakcet and sync between those two buckets in case of failure i can delete the old cluster and its associated kops state store and restore it from the backup s3 state store.

Any idea will that work?

bravitejareddy commented 4 years ago

@hakman I have created a test kops cluster enabled the versioning in S3 bucket, deployed couple of apps and before deleting the cluster I created another bucket in the same region and done an AWS s3 sync between two s3 buckets and once the sync is completed delete the kops delete cluster and this also deleted the state store in S3 bucket.

To restore the cluster back with all the apps, I run the kops update cluster with the sync bucket (2nd bucket backup) and i am able to make the cluster up but my old deployments were gone.

To test this I created another kops cluster, deployed apps done sync to another s3 bucket before deleting the cluster and deleted the cluster. Now before running the kops update I downloaded the etcd-manager-ctl and performed the list-backups command for both main and events of etcd using the newly sync s3 bucket state store i can get the old backups and performed the restore. After the restore, i have done kops update with a new s3 bucket name. Even this time my cluster is up but the deployments which i used to have in old cluster which was deleted as part of this exercise were gone...

Any help like where am I missing ??

Note: Enabled Versioning in both the s3 buckets

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kops/issues/9600#issuecomment-751228835): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kops

Deleted kops cluster accidentally but we still have the S3 state store as the cluster deletion did not happen completely. #9600