Closed simnalamburt closed 3 years ago
I can reproduce the problem using kops 1.7.5
If you run kops update cluster $NAME --target=terraform
after the terraform apply it, it's actuallyt gonna generate a new certificate kops export kubecfg $NAME
after that and you got a working thing. Although i know, its not pretty straightforward
I also had the same reported issue. I took @pastjean advice and re-run kops update cluster $NAME --target=terraform
and then kops export kubecfg $NAME
. While this updated my kube config with the proper DNS name to the api ELB, I still have an invalid cert error.
Nevermind. I have to create the cluster with --target=terraform
first. After running terraform apply and then updating I get a new master cert. I was creating the cluster, then I updated with --target=terraform
, then apply, then I re-ran the update. This didn't generate a new cert. So my bad on the order. Issue is resolved. Thanks.
Closing!
Bug is still valid for me - and @pastjean solution is not working for me. I'm using an s3 remote store, here are my versions:
$ uname -a
Darwin xxxxx 17.0.0 Darwin Kernel Version 17.0.0: Thu Aug 24 21:48:19 PDT 2017; root:xnu-4570.1.46~2/RELEASE_X86_64 x86_64
$ kops version
Version 1.7.1
$ terraform version
Terraform v0.10.8
$ aws --version
aws-cli/1.11.137 Python/2.7.10 Darwin/17.0.0 botocore/1.6.4
to reproduce, I do the same steps as @simnalamburt reported. I then run kops update cluster $NAME --target=terraform --out=.
and terraform apply
, but I still have an invalid
certificate (it does not get the alias of the AWS LB).
Checking the s3 store, in the folder <cluster-name>/pki/issued/master
, I can see that a first certificate is created when creating the cluster with kops
- and a 2nd is added after the kops update
request. The 2nd certificate does include the LB DNS name - but it is not deployed into the master(s) nodes.
Here is the update command output:
kops update cluster $NAME --target=terraform --out=.
I1106 18:10:54.285184 8239 apply_cluster.go:420] Gossip DNS: skipping DNS validation
I1106 18:10:55.044907 8239 executor.go:91] Tasks: 0 done / 83 total; 38 can run
I1106 18:10:55.467860 8239 executor.go:91] Tasks: 38 done / 83 total; 15 can run
I1106 18:10:55.469345 8239 executor.go:91] Tasks: 53 done / 83 total; 22 can run
I1106 18:10:56.032321 8239 executor.go:91] Tasks: 75 done / 83 total; 5 can run
I1106 18:10:56.691785 8239 vfs_castore.go:422] Issuing new certificate: "master"
I1106 18:10:57.160535 8239 executor.go:91] Tasks: 80 done / 83 total; 3 can run
I1106 18:10:57.160867 8239 executor.go:91] Tasks: 83 done / 83 total; 0 can run
I1106 18:10:57.261829 8239 target.go:269] Terraform output is in .
I1106 18:10:57.529372 8239 update_cluster.go:247] Exporting kubecfg for cluster
Kops has set your kubectl context to ci5-test.k8s.local
Terraform output has been placed into .
Changes may require instances to restart: kops rolling-update cluster
As you can see, the log reports that the certificate is generated. I've tried doing a kops rolling-update cluster --cloudonly
as recommended, but the output is No rolling-update required.
@sybeck2k, we have also experienced this issue as of a few hours ago.
You will need to run kops rolling-update cluster --cloudonly --force --yes
to force an update. This can take awhile depending on the size of the cluster, but we have found that trying to manually set the --master-interval
or --node-interval
can prevent nodes from reaching a Ready state. I suggest just grabbing some ☕️ and let the default interval do it's thing.
It is still a workaround solution atm, but we have found it to be repeatably successful.
This should be fixed in master. If someone wants to test master or wait for the 1.8 beta release
@jlaswell thanks a lot! I can confirm your workaround works for kops 1.7.1 . Could anyone point me to the details of what is exactly pulled from the state store, and when? In the doc I found this information
Not sure about what is used when. I would bet that looking through some of the source code best for that but, I do know that you can look in the s3 bucket used for the state store if you are using AWS. We've perused that a few times to get an understanding.
@chrislovecnm I can still reproduce this in 1.8.0-beta.1
. Both the steps are still required:
kops update cluster $NAME --target=terraform --out=.
kops rolling-update cluster --cloudonly --force --yes
@shashanktomar I would assume the work flow is
What does rolling update show?
If would be a bug that the update does not create the same hash in the tf code that we are doing in the direct target code path.
@chrislovecnm I can reproduce this in 1.8.0-beta.1
as well. As @shashanktomar are still required:
kops update cluster $NAME --state s3://bucket --target=terraform --out=.
kops rolling-update cluster --cloudonly --force --yes
Here is the rolling update output:
Using cluster from kubectl context: dev.xxx.k8s.local
NAME STATUS NEEDUPDATE READY MIN MAX
master-eu-west-2a Ready 0 1 1 1
nodes Ready 0 2 2 2
W1115 15:28:50.519884 16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:28:50.519898 16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "master-eu-west-2a.masters.dev.xxx.k8s.local".
W1115 15:33:50.723093 16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:33:50.723189 16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:33:50.723203 16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:35:50.930041 16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:35:50.930978 16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:35:50.931003 16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:37:51.117159 16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
I1115 15:37:51.117407 16811 rollingupdate.go:174] Rolling update completed!
I reproduced this in 1.8.0
after kops create cluster ... --target=terraform
and terraform apply
I can confirm that running the following fixed it:
kops update cluster $NAME --target=terraform
kops rolling-update cluster $NAME --cloudonly --force --yes
More detail please
I am having the same problem here with (see version info below), the work around does indeed work but it takes way too long to complete - it would be great if this could be resolved.
kops version
Version 1.8.0 (git-5099bc5)
As above, this is still broken in:
Version 1.8.1 (git-94ef202)
Generally, as I understand it, the workaround flow is:
kops create cluster $NAME --target=terraform -out=.
terraform apply
kops rolling-update cluster $NAME --cloudonly --force --yes
(around 20 mins with 3masters and 3 nodes) and then it should work but I had to re-export kops config
kops export kubecfg $NAME
and now it works for both kops and kubectl.
Are there any ideas as to how resolve this? I was also wondering if, in general, gossip-based is inferior to DNS approach?
The fix using rolling-update did not work for me.
Version 1.9.0 (git-cccd71e67)
@Mosho1 did you export the config?
Can you check if the server
in the ~/.kube/config points to an external endpoint?
@mbolek yeah, it did, though I have already brought down that cluster and used kops
directly instead.
Fyi: Still broken in 1.9.0
in 1.9.1 too. I am running a gossip-based cluster (.local
)
and Was able to work around this issue by following comments above.
# assume that you already applied terraform once and ELB for kube api is generated on AWS
# make sure that export kubecfg before applying terraform so that LC is configured with exported cfg.
kops export kubecfg --name $NAME
kops update cluster $NAME --target=terraform --out=.
terraform plan
terraform apply
kops rolling-update cluster $NAME --cloudonly --force --yes
In case of continuous failing you might add insecure-skip-tls-verify: true
into the cluster entry in ~/.kube/config
but usually its not recommended.
Who wants to do a rolling-update straight after provisioning a cluster? kops should provision the correct server entries in the kubectl config file in the first place - Given that kops creates a dns entry just fine with a sensible name e.g. api.cluster.mydomain.net (as an alias record to the elb/alb), why isnt kops export kubecfg using the alias record in the server and not the elb? This alias record is already in the certificate as OP says, and if kops generates a kubectl config entry using a server: https://[alias record], then it works just fine, and no rolling-updates or post-shenanigans are needed.
This should work out of the box
#kops version
Version 1.9.2 (git-cb54c6a52
Ok... so I though I had something but it seems the issue persists. You need to export config to fix the API server endpoint and you need to roll master to fix the SSL cert
Another workaround that does not require waiting to roll the master(s) is to create the ELB, then update the cluster and then do the rest of the terraform apply. Steps are:
terraform apply -target aws_internet_gateway.CLUSTERNAME-k8s-local
terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local
kops update cluster --out=. --target=terraform
terraform apply
@mbolek the issue indeed persists kops version Version 1.10.0
@drzero42 - Thanks for the tip! it works, but you forgot to prefix -target
on the 2nd apply step
i.e.:
Create ELB: terraform apply aws_elb.api-CLUSTERNAME-k8s-local
should be: Create ELB: terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local
@CosmoPennypacker Absolutely right, good catch. I've updated my comment so people can more easily copy/paste from it ;)
For our creation script we have implemented this work around however as we don't care about downtime we do the following to speed up creation from ~20minutes to ~5 minutes
kops rolling-update cluster --cloudonly --force --master-interval=1s --node-interval=1s --yes
# Wait until all nodes come back online before marking complete
until kops validate cluster --name ${CLUSTER_NAME} > /dev/null
do
echo "\033[1;93mWaiting until cluster comes back online\033[0m"
sleep 5
done
echo "\033[1;92mCluster Creation Complete!\033[0m"
Technically, if you template the steps that generate the API certificate, you could feed the ELB DNS name output in terraform to the script before it generates it initially and stores it in the state store.
@teagy-cr and do you know how to do that?
This is still a valid issue, using the workaround outlined above.
This issue is still persist...
This issue is still persist... Kops version = 1.11.1
kops validate cluster
Using cluster from kubectl context: milkyway.k8s.local Validating cluster milkyway.k8s.local
unexpected error during validation: error listing nodes: Get https://api.milkyway.k8s.local/api/v1/nodes: dial tcp: lookup api.milkyway.k8s.local on 192.168.88.1:53: no such host
The configuration generated by kops and terraform continue to treat the API endpoint as a DNS .k8s.local and not with the ELB
@lieut-data, @gabrieljackson
# kops validate cluster dev2.k8s.local
Validating cluster dev2.k8s.local
unexpected error during validation: error listing nodes: Get https://api.dev2.k8s.local/api/v1/nodes: dial tcp: lookup api.dev2.k8s.local on 192.168.1.1:53: no such host
# kops version
Version 1.12.1 (git-e1c317f9c)
:( and then
kops update cluster dev2.k8s.local --target=terraform --out=.
I0528 09:50:11.688046 5454 apply_cluster.go:559] Gossip DNS: skipping DNS validation
I0528 09:50:13.463433 5454 executor.go:103] Tasks: 0 done / 95 total; 46 can run
I0528 09:50:14.477643 5454 executor.go:103] Tasks: 46 done / 95 total; 27 can run
I0528 09:50:15.245765 5454 executor.go:103] Tasks: 73 done / 95 total; 18 can run
I0528 09:50:16.011278 5454 executor.go:103] Tasks: 91 done / 95 total; 3 can run
I0528 09:50:17.764038 5454 vfs_castore.go:729] Issuing new certificate: "master"
I0528 09:50:19.326700 5454 executor.go:103] Tasks: 94 done / 95 total; 1 can run
I0528 09:50:19.327570 5454 executor.go:103] Tasks: 95 done / 95 total; 0 can run
I0528 09:50:19.348221 5454 target.go:312] Terraform output is in .
I0528 09:50:19.573892 5454 update_cluster.go:291] Exporting kubecfg for cluster
kops has set your kubectl context to dev2.k8s.local
Terraform output has been placed into .
Changes may require instances to restart: kops rolling-update cluster
So it still has to recreate the master cert
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
This issue still persists.
# kops version
Version 1.13.0 (git-be5fb9019)
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Still seeing this issue. Can this be taken out of /rotten /stale status?
kops version Version 1.14.0
Still seeing this issue. Can this be taken out of /rotten /stale status?
kops version Version 1.14.0
/remove-lifecycle rotten
Still seeing this issue.
Version 1.17.0-alpha.1 (git-501baf7e5) Terraform v0.12.16
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
Still seeing this issue.
Kops - Version 1.18.2 Terraform - v0.13.4
Hi Team,
Still i am getting certificate issue.
i am using Kops verion v1.19 terraform version v0.13.5
Currently experiencing this
kops
-> Version 1.18.2 (git-84495481e4)
terraform
-> v0.13.5
Will try using terraform:0.14.0
Tried terraform:0.14.0
Still capturing the .k8s.local
instead of the correct ELB address.
/reopen
@kmichailg: You can't reopen an issue/PR unless you authored it or you are a collaborator.
If you create a cluster with both terraform and gossip options enabled, all
kubectl
commands shall fail.How to reproduce the error
My environment
Setting up the cluster
Spoiler Alert: Creating the self-signed certificate before creating actual Kubernetes cluster is the root cause of this issue. Please continue to see why.
Scenario 1. Looking up non-existent domain
This is basically because of erroneous
~/.kube/config
file. If you run thekops create cluster
with both terraform and gossip options enabled, you'll get wrong~/.kube/config
file.Let's manually correct that file. Or, you'll get good config file if you explicitly export the configuration once again.
Then the non-existent domain will be replaced with the ELB of master nodes' DNS name.
And you'll be ended up to the scenario 2 when you retry.
Scenario 2. Invalid certificate
This is simply because the DNS name of ELB is not included in the certificate. This scenario occurs only when you create the cluster with terraform option being enabled. If you try to create the cluster only with gossip option not using the terraform target, the self-signed certificate will properly contain the DNS name of ELB.
(Sorry for the Korean, this is the list of DNS alternative names of certificate)
The only way to workaround this problem is forcing "kops-temp.k8s.local" to point proper IP address via manually editing
/etc/hosts
, which is undesired for many people.I'm not very familiar with Kops internal, but I expect a huge change to properly fix this issue. Maybe using AWS Certificate Manager can be a solution. (#834) Any ideas?