Failed Kops upgrade 1.18.3 => 1.19.2 - Issuing new certificate: "master"

jlsan92 commented 3 years ago

After running kops update cluster ... I notice Issuing new certificate: "master". Then after a couple of seconds, kubectl and even kops rolling-update cluster fail with

Unable to connect to the server: x509: certificate is valid for service-account, not api.my.cluster.net

To be able to connect again, I need to run kops update cluster again but with Kops 1.18.3 (The version I want to move from). Same behaviour, a couple of seconds after then I can connect back to my cluster's API. Seems impossible to upgrade to Kops 1.19. Not sure if kops rolling-update cluster --cloudonly would return a healthy cluster.

Not sure if I'm missing an important step but Required Actions don't mention anything regarding certificates.

1. What kops version are you running? The command kops version, will display this information.

Version 1.19.2 (git-e288df46e173ba8ce44ac52502283206c0d211ee)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.18", GitCommit:"6f6ce59dc8fefde25a3ba0ef0047f4ec6662ef24", GitTreeState:"clean", BuildDate:"2021-04-15T03:23:41Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops update cluster --name $KOPS_CLUSTER_NAME --lifecycle-overrides IAMRole=ExistsAndWarnIfChanges,IAMRolePolicy=ExistsAndWarnIfChanges,IAMInstanceProfileRole=ExistsAndWarnIfChanges --out=config/terraform --target=terraform

5. What happened after the commands executed?

Normal kops update cluster output except for a new line => 57165 vfs_castore.go:590] Issuing new certificate: "master"

6. What did you expect to happen?

Normal kops update cluster behaviour without rotating master's certs

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: my.cluster.net
spec:
  additionalPolicies: {}
  addons:
  - manifest: s3://my-kops-state-store/my.cluster.net/addons/flux.addons.weave.works/addon.yaml
  - manifest: s3://my-kops-state-store/my.cluster.net/addons/helm-operator.addons.weave.works/addon.yaml
  - manifest: s3://my-kops-state-store/my.cluster.net/addons/dns-controller-internal.addons.k8s.io/addon.yaml
  api:
    loadBalancer:
      crossZoneLoadBalancing: true
      idleTimeoutSeconds: 600
      type: Public
  authentication:
    aws: {}
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Cluster: my.cluster.net
  cloudProvider: aws
  configBase: s3://my-kops-state-store/my.cluster.net
  dnsZone: XXXXXXXXX
  etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-east-1a
      name: member-a
    - encryptedVolume: true
      instanceGroup: master-us-east-1b
      name: member-b
    - encryptedVolume: true
      instanceGroup: master-us-east-1c
      name: member-c
    name: main
    provider: Manager
    version: 3.3.10
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-east-1a
      name: member-a
    - encryptedVolume: true
      instanceGroup: master-us-east-1b
      name: member-b
    - encryptedVolume: true
      instanceGroup: master-us-east-1c
      name: member-c
    name: events
    provider: Manager
    version: 3.3.10
  iam:
    legacy: false
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
    - AlwaysPullImages
    auditLogMaxAge: 30
    auditLogMaxBackups: 10
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/audit.yaml
    authorizationMode: RBAC,Node
    disableBasicAuth: true
    enableProfiling: false
    tlsCipherSuites:
    - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
    - TLS_RSA_WITH_AES_128_CBC_SHA
    - TLS_RSA_WITH_AES_256_CBC_SHA
    - TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
  kubeControllerManager:
    enableProfiling: false
    horizontalPodAutoscalerUseRestClients: true
    tlsCipherSuites:
    - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
    - TLS_RSA_WITH_AES_128_CBC_SHA
    - TLS_RSA_WITH_AES_256_CBC_SHA
    - TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
  kubeDNS:
    provider: CoreDNS
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubeScheduler:
    enableProfiling: false
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    kubeReserved:
      cpu: 360m
      memory: 893Mi
    protectKernelDefaults: false
    readOnlyPort: 0
    streamingConnectionIdleTimeout: 5m0s
    systemReserved:
      cpu: 100m
      memory: 100Mi
    tlsCipherSuites:
    - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
    - TLS_RSA_WITH_AES_128_CBC_SHA
    - TLS_RSA_WITH_AES_256_CBC_SHA
    - TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
  kubernetesApiAccess:
  - 222.222.222.222/32
  kubernetesVersion: 1.18.18
  masterPublicName: api.my.cluster.net
  networkCIDR: 172.20.0.0/16
  networking:
    cilium:
      IPTablesRulesNoinstall: false
      autoDirectNodeRoutes: false
      bpfCTGlobalAnyMax: 0
      bpfCTGlobalTCPMax: 0
      clusterName: ""
      cniBinPath: ""
      enableNodePort: false
      enablePrometheusMetrics: true
      enableRemoteNodeIdentity: false
      enableipv4: false
      enableipv6: false
      monitorAggregation: ""
      nodeInitBootstrapFile: ""
      preallocateBPFMaps: false
      reconfigureKubelet: false
      removeCbrBridge: false
      restartPods: false
      sidecarIstioProxyImage: ""
      toFqdnsEnablePoller: false
  nonMasqueradeCIDR: 100.64.0.0/10
  subnets:
  - cidr: 172.20.32.0/19
    name: private-us-east-1a
    type: Private
    zone: us-east-1a
  - cidr: 172.20.64.0/19
    name: private-us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 172.20.96.0/19
    name: private-us-east-1c
    type: Private
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my.cluster.net
  name: master-us-east-1a
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  iam:
    profile: arn:aws:iam::123456789:instance-profile/kops-master-role
  image: 075585003325/Flatcar-stable-2605.12.0-hvm
  machineType: m5.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - private-us-east-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my.cluster.net
  name: master-us-east-1b
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  iam:
    profile: arn:aws:iam::123456789:instance-profile/kops-master-role
  image: 075585003325/Flatcar-stable-2605.12.0-hvm
  machineType: m5.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1b
  role: Master
  subnets:
  - private-us-east-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my.cluster.net
  name: master-us-east-1c
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  iam:
    profile: arn:aws:iam::123456789:instance-profile/kops-master-role
  image: 075585003325/Flatcar-stable-2605.12.0-hvm
  machineType: m5.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1c
  role: Master
  subnets:
  - private-us-east-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my.cluster.net
  name: nodes
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/my.cluster.net: owned
  detailedInstanceMonitoring: false
  iam:
    profile: arn:aws:iam::123456789:instance-profile/kops-worker
  image: 075585003325/Flatcar-stable-2605.12.0-hvm
  machineType: m5.xlarge
  maxSize: 10
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - private-us-east-1a
  - private-us-east-1b
  - private-us-east-1c

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

johngmyers commented 3 years ago

It looks like you might have had a control plane node starting up using a 1.18.3 launch configuration/template, but after running "kops update cluster" to take it to 1.19.2. Next time you need to avoid that race condition and have new control plane nodes come up with the 1.19.2 launch configuration/template.

jlsan92 commented 3 years ago

Hey @johngmyers thx for the reply. I'm pretty sure no control plane nodes have started after running "update cluster" with 1.19.2.

It's like the current ones pick up the new issued certificate from the S3 bucket holding the Kops state config.

Also, I see cilium is crashing badly across worker nodes as well.

As I mentioned, there's no even chance to run "rolling-update cluster" without hitting this issue.

johngmyers commented 3 years ago

I can't think of what would be updating the control plane nodes from S3.

In any case, this is fixed by fdc61b4bdbffe33cc28340fa553668b423d5e83e which is in 1.21. It might be worth a backport.

justinsb commented 3 years ago

So we have a fix, but I'm also trying to reproduce the problem so this won't happen again.

I'm trying to create a cluster with kops 1.18.3, k8s 1.18.18, install the cluster, then move to kops 1.19.2, do a kops update, kops update cluster --yes, then I try reusing the same same kubeconfig (still works), then I try doing a rolling update with 1.19.2 (still works).

I'm sure the problem is real, but I'm not entirely sure why I can't reproduce it. Anyone have any ideas?

johngmyers commented 3 years ago

Somehow a 1.18 apiserver needs to pick up the reissued certificate.

I think you might need to kops update cluster --target terraform and maybe update a master before applying.

jlsan92 commented 3 years ago

Also, try having S3 as the Kops state store. Once the Issuing new certificate: "master" message appears, new certs appear on

s3://kops-state-store/my.cluster.net/pki/issued/master/XXXX.crt

and

s3://kops-state-store/my.cluster.net/pki/private/master/XXXX.key

And, somehow they get picked up by the current master nodes (no rollout yet) and the problem starts

johngmyers commented 3 years ago

@jlsan92 which target were you using?

justinsb commented 3 years ago

@jlsan92 I'm wondering if your nodeup is looping and running constantly for you. If you're able to SSH to a control-plane (master) node, could you look at journalctl -fu kops-configuration and see if it is still running? It should run once at bootup, and then should be idle.

Another possibility is that you got very unlucky and a control-plane node happened to restart at "just the wrong" moment, but that seems very unlikely.

jlsan92 commented 3 years ago

@jlsan92 which target were you using?

Terraform is the target cc @johngmyers

@jlsan92 I'm wondering if your nodeup is looping and running constantly for you. If you're able to SSH to a control-plane (master) node, could you look at journalctl -fu kops-configuration and see if it is still running? It should run once at bootup, and then should be idle.

Another possibility is that you got very unlucky and a control-plane node happened to restart at "just the wrong" moment, but that seems very unlikely.

Thanks for the reply @justinsb, I just checked and I see this "looping"

May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.950431 1451862 service.go:108] querying state of service "protokube.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:48:38.951352 1451862 service.go:201] Unknown WantedBy="local-fs.target"; will treat as not enabled
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.951495 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.951633 1451862 service.go:108] querying state of service "usr-lib64-modules.mount"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.953861 1451862 changes.go:81] Field changed "Running" actual="true" expected="false"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.954049 1451862 service.go:329] Restarting service "kubelet.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.973696 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.973718 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.973766 1451862 service.go:329] Restarting service "disable-automatic-updates.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:48:38.977475 1451862 service.go:201] Unknown WantedBy="basic.target"; will treat as not enabled
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.977509 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.977559 1451862 service.go:227] extracted dependency from "ExecStart=/opt/kops/bin/iptables-setup": "/opt/kops/bin/iptables-setup"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.977583 1451862 service.go:108] querying state of service "kubernetes-iptables-setup.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.980100 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.980122 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.980167 1451862 service.go:329] Restarting service "sync-etcd-internal.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.983885 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.983906 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.983946 1451862 service.go:329] Restarting service "create-overlay-dirs.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:48:38.989028 1451862 service.go:311] service was running, but did not have ExecMainStartTimestamp: "usr-lib64-modules.mount"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.989080 1451862 service.go:340] Enabling service "usr-lib64-modules.mount"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.989499 1451862 service.go:321] will not restart service "protokube.service" - started after dependencies
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.989619 1451862 service.go:340] Enabling service "protokube.service"
May 10 14:48:39 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:39.000606 1451862 service.go:321] will not restart service "kubernetes-iptables-setup.service" - started after dependencies
May 10 14:48:39 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:39.000637 1451862 service.go:340] Enabling service "kubernetes-iptables-setup.service"
May 10 14:48:40 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:40.510340 1451862 service.go:340] Enabling service "create-overlay-dirs.service"
May 10 14:48:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:41.170610 1451862 service.go:340] Enabling service "disable-automatic-updates.service"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:49:41.079639 1451862 executor.go:136] error running task "Service/sync-etcd-internal.service" (8m57s remaining to succeed): error doing systemd restart sync-etcd-internal.service: exit status 1
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: Output: Job for sync-etcd-internal.service failed because the control process exited with error code.
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: See "systemctl status sync-etcd-internal.service" and "journalctl -xe" for details.
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.079711 1451862 executor.go:111] Tasks: 96 done / 97 total; 1 can run
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.079739 1451862 executor.go:182] Executing task "Service/sync-etcd-internal.service": Service: sync-etcd-internal.service
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.079820 1451862 service.go:108] querying state of service "sync-etcd-internal.service"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.089578 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.089597 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.089693 1451862 service.go:329] Restarting service "sync-etcd-internal.service"

The last line doesn't look good, sync-etcd-internal.service keeps restarting, might be that?

And, no, I made sure my masters didn't rotate before/during the kops update cluster ...

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes / kops

Failed Kops upgrade 1.18.3 => 1.19.2 - Issuing new certificate: "master" #11370