kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.96k stars 4.65k forks source link

1.14.10 upgrade fails with etcd mismatch (kops 1.14.1) #9515

Closed mbolek closed 4 years ago

mbolek commented 4 years ago

1. What kops version are you running? The command kops version, will display this information. Version 1.14.1 (git-b7c25f9a9)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. 1.14.10

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? Upgrading k8s 1.13.12 to 1.14.10 with kops 1.13.2 -> 1.14.1 and terraform. Ran:

kops update cluster --out=. --target=terraform
terraform apply
kops rolling-update cluster --yes

5. What happened after the commands executed? cluster failed validation after master rollout

6. What did you expect to happen? cluster to validate

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster               
metadata:                              
  creationTimestamp: 2019-05-28T06:53:41Z
  generation: 2        
  name: cluster.k8s.local 
spec:                                           
  api:                                   
    loadBalancer:                                          
      type: Public      
  authorization:       
    rbac: {}                
  channel: stable
  cloudLabels:                             
    env: name
    kops: "true"
  cloudProvider: aws
  configBase: s3://name.dev.kops/name.k8s.local
  etcdClusters:   
  - cpuRequest: 200m                                                                                                                                                                                                                                                                      
    etcdMembers:
    - instanceGroup: master-us-east-1d
      name: d
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-east-1d
      name: d
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - PodSecurityPolicy
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.14.10
  masterInternalName: api.internal.name.k8s.local
  masterPublicName: api.name.k8s.local
  networkCIDR: 172.20.0.0/16
  networkID: vpc-asdasdasd
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    id: subnet-7f282124
    name: us-east-1d
    type: Private
    zone: us-east-1d
  - cidr: 172.20.64.0/19
    id: subnet-4dda9172
    name: us-east-1e
    type: Private
    zone: us-east-1e
  - cidr: 172.20.96.0/19
    id: subnet-44ab464b
    name: us-east-1f
    type: Private
    zone: us-east-1f
  - cidr: 172.20.0.0/22
    id: subnet-d5141d1e
    name: utility-us-east-1d
    type: Utility
    zone: us-east-1d
  - cidr: 172.20.4.0/22
    id: subnet-87d89az8
    name: utility-us-east-1e
    type: Utility
    zone: us-east-1e
  - cidr: 172.20.8.0/22
    id: subnet-bca964c3
    name: utility-us-east-1f
    type: Utility
    zone: us-east-1f
  topology:
    dns:
      type: Public
    masters: private
    nodes: private
---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-05-28T06:53:42Z
  generation: 3
  labels:
    kops.k8s.io/cluster: name.k8s.local
  name: master-us-east-1d
spec:
  image: amazon.com/amzn2-ami-hvm-2.0.20200207.1-x86_64-gp2
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1d
  role: Master
  subnets:
  - us-east-1d

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-05-28T06:53:42Z
  generation: 6
  labels:
    kops.k8s.io/cluster: name.k8s.local
  name: nodes
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/name.k8s.local: ""
    k8s.io/cluster-autoscaler/enabled: ""
  image: amazon.com/amzn2-ami-hvm-2.0.20200207.1-x86_64-gp2
  machineType: c5.xlarge
  maxSize: 3
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-east-1d
  - us-east-1e
  - us-east-1f

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-05-29T06:23:22Z
  generation: 7
  labels:
    kops.k8s.io/cluster: name.k8s.local
  name: writer-nodes
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/name.k8s.local: ""
    k8s.io/cluster-autoscaler/enabled: ""
  image: amazon.com/amzn2-ami-hvm-2.0.20200207.1-x86_64-gp2
  machineType: c5.xlarge
  maxSize: 2
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: writer-nodes
  role: Node
  subnets:
  - us-east-1d
  - us-east-1e
  - us-east-1f

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know? apiserver doesn't start, it fails with:

W0707 07:46:05.856159       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
I0707 07:46:06.834736       1 client.go:352] parsed scheme: ""
I0707 07:46:06.834753       1 client.go:352] scheme "" not registered, fallback to default scheme
I0707 07:46:06.834792       1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:4001 0  <nil>}]
I0707 07:46:06.834833       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:4001 <nil>}]
W0707 07:46:06.838959       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:06.855854       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:07.840251       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:08.720583       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:09.736956       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:11.148159       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:11.853048       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:15.010850       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:15.670238       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:20.398223       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:22.402085       1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0707 07:46:25.851237       1 asm_amd64.s:1337] Failed to dial 127.0.0.1:4001: context canceled; please retry.
F0707 07:46:25.851203       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc0007bd830 apiextensions.k8s.io/v1beta1 <nil> 5m0s 1m0s}), err (context deadline exceeded)

even though cert seems ok

        Validity
            Not Before: May 26 06:53:50 2019 GMT
            Not After : Jul  7 07:15:01 2021 GMT

and etcd boots with:

I0707 07:16:12.578029    4825 etcdserver.go:534] starting etcd with state cluster:<cluster_token:"Mbgaas2t6d-2KJFkq1RofQ" nodes:<name:"etcd-d" peer_urls:"https://etcd-d.internal.cluster.k8s.local:2380" client_urls:"https://etcd-d.internal.cluster.k8s.local:4001" quarantined_client_urls:"
https://etcd-d.internal.cluster.k8s.local:3994" tls_enabled:true > > etcd_version:"3.2.24" 

when I think it should go 3.3.10 (it's 3.3.10 in the Launch Configuration)

I0707 07:16:45.341190    4825 controller.go:417] mismatched version for peer peer{id:"etcd-d" endpoints:"172.20.61.197:3996" }: want "3.3.10", have "3.2.24"

the etcd mounted volumes (main and events) have 3.2.24 in the state file so maybe that's the reason? I expected it to simply update in-place to 3.3.10

Should I have updated etcd manually to 3.3.10 and then start k8s upgrade?

hakman commented 4 years ago

@mbolek could you try setting the version to for both etcd clusters manually and see if helps?

spec:
  etcdClusters:
  - cpuRequest: 200m
    ...
    cluster.spec.etcdClusters[*].version=3.3.10
  - cpuRequest: 100m
    ...
    cluster.spec.etcdClusters[*].version=3.3.10
mbolek commented 4 years ago

I think I've got it (and maybe it's related to the general etcd issue with certs?) etcd tries to run with a cert from the EBS volume which has expired :(

[root@ip-172-20-33-210 ~]# openssl x509 -in /mnt/master-vol-0affaaafe8ae78f4d/pki/MbgaZ62t6d-2KJFas1RofQ/peers/me.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 6456588394770812402 (0x599a68efc3394df2)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=etcd-peers-ca-main
        Validity
            Not Before: May 26 06:53:50 2019 GMT
            Not After : May 27 07:44:51 2020 GMT
        Subject: CN=etcd-d

@hakman I think I had it set to 3.3.10 in the etcd settings etc. Can't really tell right now as I've tried to roll back but it seems the issue was possibly the certs all along

hakman commented 4 years ago

Great. One more thing, you should probably use Kops 1.17.1 to manage your cluster even if you use an older version of k8s. There are many bug fixes since 1.14 and, if you get into trouble, you may have to switch to a newer version of Kops anyway. (not to confuse with directly migration to k8s 1.17)

mbolek commented 4 years ago

yup... was certs all along, I got sidetracked by not knowing it would use cert from EBS volume:/ I've recreated the cert as with the current etcd advisory and 1.13.12 stood up. Updating to 1.14.10 now but I expect it to work. as for 1.17 I planned to do so but understood there were some major changes in 1.14 -> 1.15 for kops so wanted to build it up gradually. will move ASAP Thanks for a super quick reply @hakman :+1: