Closed jlsan92 closed 3 years ago
It looks like you might have had a control plane node starting up using a 1.18.3 launch configuration/template, but after running "kops update cluster" to take it to 1.19.2. Next time you need to avoid that race condition and have new control plane nodes come up with the 1.19.2 launch configuration/template.
Hey @johngmyers thx for the reply. I'm pretty sure no control plane nodes have started after running "update cluster" with 1.19.2.
It's like the current ones pick up the new issued certificate from the S3 bucket holding the Kops state config.
Also, I see cilium is crashing badly across worker nodes as well.
As I mentioned, there's no even chance to run "rolling-update cluster" without hitting this issue.
I can't think of what would be updating the control plane nodes from S3.
In any case, this is fixed by fdc61b4bdbffe33cc28340fa553668b423d5e83e which is in 1.21. It might be worth a backport.
So we have a fix, but I'm also trying to reproduce the problem so this won't happen again.
I'm trying to create a cluster with kops 1.18.3, k8s 1.18.18, install the cluster, then move to kops 1.19.2, do a kops update
, kops update cluster --yes
, then I try reusing the same same kubeconfig (still works), then I try doing a rolling update with 1.19.2 (still works).
I'm sure the problem is real, but I'm not entirely sure why I can't reproduce it. Anyone have any ideas?
Somehow a 1.18 apiserver needs to pick up the reissued certificate.
I think you might need to kops update cluster --target terraform
and maybe update a master before applying.
Also, try having S3 as the Kops state store. Once the Issuing new certificate: "master"
message appears, new certs appear on
s3://kops-state-store/my.cluster.net/pki/issued/master/XXXX.crt
and
s3://kops-state-store/my.cluster.net/pki/private/master/XXXX.key
And, somehow they get picked up by the current master nodes (no rollout yet) and the problem starts
@jlsan92 which target were you using?
@jlsan92 I'm wondering if your nodeup is looping and running constantly for you. If you're able to SSH to a control-plane (master) node, could you look at journalctl -fu kops-configuration
and see if it is still running? It should run once at bootup, and then should be idle.
Another possibility is that you got very unlucky and a control-plane node happened to restart at "just the wrong" moment, but that seems very unlikely.
@jlsan92 which target were you using?
Terraform is the target cc @johngmyers
@jlsan92 I'm wondering if your nodeup is looping and running constantly for you. If you're able to SSH to a control-plane (master) node, could you look at
journalctl -fu kops-configuration
and see if it is still running? It should run once at bootup, and then should be idle.Another possibility is that you got very unlucky and a control-plane node happened to restart at "just the wrong" moment, but that seems very unlikely.
Thanks for the reply @justinsb, I just checked and I see this "looping"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.950431 1451862 service.go:108] querying state of service "protokube.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:48:38.951352 1451862 service.go:201] Unknown WantedBy="local-fs.target"; will treat as not enabled
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.951495 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.951633 1451862 service.go:108] querying state of service "usr-lib64-modules.mount"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.953861 1451862 changes.go:81] Field changed "Running" actual="true" expected="false"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.954049 1451862 service.go:329] Restarting service "kubelet.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.973696 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.973718 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.973766 1451862 service.go:329] Restarting service "disable-automatic-updates.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:48:38.977475 1451862 service.go:201] Unknown WantedBy="basic.target"; will treat as not enabled
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.977509 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.977559 1451862 service.go:227] extracted dependency from "ExecStart=/opt/kops/bin/iptables-setup": "/opt/kops/bin/iptables-setup"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.977583 1451862 service.go:108] querying state of service "kubernetes-iptables-setup.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.980100 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.980122 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.980167 1451862 service.go:329] Restarting service "sync-etcd-internal.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.983885 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.983906 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.983946 1451862 service.go:329] Restarting service "create-overlay-dirs.service"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:48:38.989028 1451862 service.go:311] service was running, but did not have ExecMainStartTimestamp: "usr-lib64-modules.mount"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.989080 1451862 service.go:340] Enabling service "usr-lib64-modules.mount"
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.989499 1451862 service.go:321] will not restart service "protokube.service" - started after dependencies
May 10 14:48:38 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:38.989619 1451862 service.go:340] Enabling service "protokube.service"
May 10 14:48:39 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:39.000606 1451862 service.go:321] will not restart service "kubernetes-iptables-setup.service" - started after dependencies
May 10 14:48:39 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:39.000637 1451862 service.go:340] Enabling service "kubernetes-iptables-setup.service"
May 10 14:48:40 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:40.510340 1451862 service.go:340] Enabling service "create-overlay-dirs.service"
May 10 14:48:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:48:41.170610 1451862 service.go:340] Enabling service "disable-automatic-updates.service"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: W0510 14:49:41.079639 1451862 executor.go:136] error running task "Service/sync-etcd-internal.service" (8m57s remaining to succeed): error doing systemd restart sync-etcd-internal.service: exit status 1
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: Output: Job for sync-etcd-internal.service failed because the control process exited with error code.
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: See "systemctl status sync-etcd-internal.service" and "journalctl -xe" for details.
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.079711 1451862 executor.go:111] Tasks: 96 done / 97 total; 1 can run
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.079739 1451862 executor.go:182] Executing task "Service/sync-etcd-internal.service": Service: sync-etcd-internal.service
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.079820 1451862 service.go:108] querying state of service "sync-etcd-internal.service"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.089578 1451862 changes.go:81] Field changed "Running" actual="false" expected="true"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.089597 1451862 changes.go:81] Field changed "Enabled" actual="false" expected="true"
May 10 14:49:41 ip-172-X-X-X.ec2.internal nodeup[1451862]: I0510 14:49:41.089693 1451862 service.go:329] Restarting service "sync-etcd-internal.service"
The last line doesn't look good, sync-etcd-internal.service
keeps restarting, might be that?
And, no, I made sure my masters didn't rotate before/during the kops update cluster ...
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
After running
kops update cluster ...
I noticeIssuing new certificate: "master"
. Then after a couple of seconds,kubectl
and evenkops rolling-update cluster
fail withTo be able to connect again, I need to run
kops update cluster
again but with Kops 1.18.3 (The version I want to move from). Same behaviour, a couple of seconds after then I can connect back to my cluster's API. Seems impossible to upgrade to Kops 1.19. Not sure ifkops rolling-update cluster --cloudonly
would return a healthy cluster.Not sure if I'm missing an important step but Required Actions don't mention anything regarding certificates.
1. What
kops
version are you running? The commandkops version
, will display this information.Version 1.19.2 (git-e288df46e173ba8ce44ac52502283206c0d211ee)
2. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag.3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
5. What happened after the commands executed?
Normal
kops update cluster
output except for a new line =>57165 vfs_castore.go:590] Issuing new certificate: "master"
6. What did you expect to happen?
Normal
kops update cluster
behaviour without rotating master's certs7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here.9. Anything else do we need to know?