Cannot use terraform and gossip-based cluster at the same time

simnalamburt commented 7 years ago

If you create a cluster with both terraform and gossip options enabled, all kubectl commands shall fail.

How to reproduce the error

My environment

$ uname -a
Darwin *****.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64

$ kops version
Version 1.6.2

$ terraform version
Terraform v0.9.11

$ aws --version
aws-cli/1.11.117 Python/2.7.10 Darwin/16.6.0 botocore/1.5.80

Setting up the cluster

# Create RSA key
ssh-keygen -f shared_rsa -N ""

# Create S3 bucket
aws s3api create-bucket \
  --bucket=kops-temp \
  --region=ap-northeast-1 \
  --create-bucket-configuration LocationConstraint=ap-northeast-1

# Create terraform codes and some resources
# including *certificates* will be stored to S3
kops create cluster \
  --name=kops-temp.k8s.local \
  --state=s3://kops-temp \
  --zones=ap-northeast-1a,ap-northeast-1c \
  --ssh-public-key=./shared_rsa.pub \
  --out=. \
  --target=terraform

# Create cluster
terraform init
terraform plan -out ./create-cluster.plan
terraform show ./create-cluster.plan | less -R # final review
terraform apply ./create-cluster.plan # fire

# Done

Spoiler Alert: Creating the self-signed certificate before creating actual Kubernetes cluster is the root cause of this issue. Please continue to see why.

Scenario 1. Looking up non-existent domain

$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.kops-temp.k8s.local on 8.8.8.8:53: no such host

This is basically because of erroneous ~/.kube/config file. If you run the kops create cluster with both terraform and gossip options enabled, you'll get wrong ~/.kube/config file.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api.kops-temp.k8s.local
            # !!!! There's no such domain named "api.kops-temp.k8s.local"
  name: kops-temp.k8s.local
# ...

Let's manually correct that file. Or, you'll get good config file if you explicitly export the configuration once again.

kops export kubecfg kops-temp.k8s.local --state s3://kops-temp

Then the non-existent domain will be replaced with the ELB of master nodes' DNS name.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com
  name: kops-temp.k8s.local
# ...

And you'll be ended up to the scenario 2 when you retry.

Scenario 2. Invalid certificate

$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for api.internal.kops-temp.k8s.local, api.kops-temp.k8s.local, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, not api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com

This is simply because the DNS name of ELB is not included in the certificate. This scenario occurs only when you create the cluster with terraform option being enabled. If you try to create the cluster only with gossip option not using the terraform target, the self-signed certificate will properly contain the DNS name of ELB.

(Sorry for the Korean, this is the list of DNS alternative names of certificate)

The only way to workaround this problem is forcing "kops-temp.k8s.local" to point proper IP address via manually editing /etc/hosts, which is undesired for many people.

# Recover ~/.kube/config
perl -i -pe \
    's|api-kops-temp-k8s-local-rnvnqsr-666666\.ap-northeast-1\.elb\.amazonaws\.com|api.kops-temp.k8s.local|g' \
    ~/.kube/config

# Hack /etc/hosts
host api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com |
    perl -pe 's|^.* address (.*)$|\1\tapi.kops-temp.k8s.local|g' |
    sudo tee -a /etc/hosts

# This will succeed
kubectl get nodes

I'm not very familiar with Kops internal, but I expect a huge change to properly fix this issue. Maybe using AWS Certificate Manager can be a solution. (#834) Any ideas?

gregd72002 commented 7 years ago

I can reproduce the problem using kops 1.7.5

pastjean commented 7 years ago

If you run kops update cluster $NAME --target=terraform after the terraform apply it, it's actuallyt gonna generate a new certificate kops export kubecfg $NAME after that and you got a working thing. Although i know, its not pretty straightforward

thedonvaughn commented 7 years ago

I also had the same reported issue. I took @pastjean advice and re-run kops update cluster $NAME --target=terraform and then kops export kubecfg $NAME. While this updated my kube config with the proper DNS name to the api ELB, I still have an invalid cert error.

thedonvaughn commented 7 years ago

Nevermind. I have to create the cluster with --target=terraform first. After running terraform apply and then updating I get a new master cert. I was creating the cluster, then I updated with --target=terraform, then apply, then I re-ran the update. This didn't generate a new cert. So my bad on the order. Issue is resolved. Thanks.

chrislovecnm commented 7 years ago

Closing!

sybeck2k commented 7 years ago

Bug is still valid for me - and @pastjean solution is not working for me. I'm using an s3 remote store, here are my versions:

$ uname -a
Darwin xxxxx 17.0.0 Darwin Kernel Version 17.0.0: Thu Aug 24 21:48:19 PDT 2017; root:xnu-4570.1.46~2/RELEASE_X86_64 x86_64
$ kops version
Version 1.7.1
$ terraform version
Terraform v0.10.8
$ aws --version
aws-cli/1.11.137 Python/2.7.10 Darwin/17.0.0 botocore/1.6.4

to reproduce, I do the same steps as @simnalamburt reported. I then run kops update cluster $NAME --target=terraform --out=. and terraform apply, but I still have an invalid certificate (it does not get the alias of the AWS LB).

Checking the s3 store, in the folder <cluster-name>/pki/issued/master, I can see that a first certificate is created when creating the cluster with kops - and a 2nd is added after the kops update request. The 2nd certificate does include the LB DNS name - but it is not deployed into the master(s) nodes.

Here is the update command output:

kops update cluster $NAME --target=terraform --out=.
I1106 18:10:54.285184    8239 apply_cluster.go:420] Gossip DNS: skipping DNS validation
I1106 18:10:55.044907    8239 executor.go:91] Tasks: 0 done / 83 total; 38 can run
I1106 18:10:55.467860    8239 executor.go:91] Tasks: 38 done / 83 total; 15 can run
I1106 18:10:55.469345    8239 executor.go:91] Tasks: 53 done / 83 total; 22 can run
I1106 18:10:56.032321    8239 executor.go:91] Tasks: 75 done / 83 total; 5 can run
I1106 18:10:56.691785    8239 vfs_castore.go:422] Issuing new certificate: "master"
I1106 18:10:57.160535    8239 executor.go:91] Tasks: 80 done / 83 total; 3 can run
I1106 18:10:57.160867    8239 executor.go:91] Tasks: 83 done / 83 total; 0 can run
I1106 18:10:57.261829    8239 target.go:269] Terraform output is in .
I1106 18:10:57.529372    8239 update_cluster.go:247] Exporting kubecfg for cluster
Kops has set your kubectl context to ci5-test.k8s.local

Terraform output has been placed into .

Changes may require instances to restart: kops rolling-update cluster

As you can see, the log reports that the certificate is generated. I've tried doing a kops rolling-update cluster --cloudonly as recommended, but the output is No rolling-update required.

jlaswell commented 7 years ago

@sybeck2k, we have also experienced this issue as of a few hours ago.

You will need to run kops rolling-update cluster --cloudonly --force --yes to force an update. This can take awhile depending on the size of the cluster, but we have found that trying to manually set the --master-interval or --node-interval can prevent nodes from reaching a Ready state. I suggest just grabbing some ☕️ and let the default interval do it's thing.

It is still a workaround solution atm, but we have found it to be repeatably successful.

chrislovecnm commented 7 years ago

This should be fixed in master. If someone wants to test master or wait for the 1.8 beta release

sybeck2k commented 7 years ago

@jlaswell thanks a lot! I can confirm your workaround works for kops 1.7.1 . Could anyone point me to the details of what is exactly pulled from the state store, and when? In the doc I found this information

jlaswell commented 7 years ago

Not sure about what is used when. I would bet that looking through some of the source code best for that but, I do know that you can look in the s3 bucket used for the state store if you are using AWS. We've perused that a few times to get an understanding.

shashanktomar commented 6 years ago

@chrislovecnm I can still reproduce this in 1.8.0-beta.1. Both the steps are still required:

kops update cluster $NAME --target=terraform --out=.
kops rolling-update cluster --cloudonly --force --yes

chrislovecnm commented 6 years ago

@shashanktomar I would assume the work flow is

kops update cluster --target=terraform
terraform apply (not sure the syntax is correct)
kops rolling-update cluster

What does rolling update show?

If would be a bug that the update does not create the same hash in the tf code that we are doing in the direct target code path.

andresguisado commented 6 years ago

@chrislovecnm I can reproduce this in 1.8.0-beta.1 as well. As @shashanktomar are still required:

kops update cluster $NAME --state s3://bucket --target=terraform --out=.
kops rolling-update cluster --cloudonly --force --yes

Here is the rolling update output:

Using cluster from kubectl context: dev.xxx.k8s.local

NAME            STATUS  NEEDUPDATE  READY   MIN MAX
master-eu-west-2a   Ready   0       1   1   1
nodes           Ready   0       2   2   2
W1115 15:28:50.519884   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:28:50.519898   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "master-eu-west-2a.masters.dev.xxx.k8s.local".

W1115 15:33:50.723093   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:33:50.723189   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:33:50.723203   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:35:50.930041   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:35:50.930978   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:35:50.931003   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:37:51.117159   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
I1115 15:37:51.117407   16811 rollingupdate.go:174] Rolling update completed!

tspacek commented 6 years ago

I reproduced this in 1.8.0 after kops create cluster ... --target=terraform and terraform apply

I can confirm that running the following fixed it: kops update cluster $NAME --target=terraform kops rolling-update cluster $NAME --cloudonly --force --yes

chrislovecnm commented 6 years ago

More detail please

bashims commented 6 years ago

I am having the same problem here with (see version info below), the work around does indeed work but it takes way too long to complete - it would be great if this could be resolved.

kops version

Version 1.8.0 (git-5099bc5)

mbolek commented 6 years ago

As above, this is still broken in: Version 1.8.1 (git-94ef202)

Generally, as I understand it, the workaround flow is: kops create cluster $NAME --target=terraform -out=. terraform apply kops rolling-update cluster $NAME --cloudonly --force --yes (around 20 mins with 3masters and 3 nodes) and then it should work but I had to re-export kops config kops export kubecfg $NAME and now it works for both kops and kubectl. Are there any ideas as to how resolve this? I was also wondering if, in general, gossip-based is inferior to DNS approach?

srolel commented 6 years ago

The fix using rolling-update did not work for me.

Version 1.9.0 (git-cccd71e67)

mbolek commented 6 years ago

@Mosho1 did you export the config? Can you check if the server in the ~/.kube/config points to an external endpoint?

srolel commented 6 years ago

@mbolek yeah, it did, though I have already brought down that cluster and used kops directly instead.

Hermain commented 6 years ago

Fyi: Still broken in 1.9.0

1ambda commented 6 years ago

in 1.9.1 too. I am running a gossip-based cluster (.local)

and Was able to work around this issue by following comments above.

# assume that you already applied terraform once and ELB for kube api is generated on AWS

# make sure that export kubecfg before applying terraform so that LC is configured with exported cfg.
kops export kubecfg --name $NAME
kops update cluster $NAME --target=terraform --out=.
terraform plan
terraform apply 

kops rolling-update cluster $NAME --cloudonly --force --yes

In case of continuous failing you might add insecure-skip-tls-verify: true into the cluster entry in ~/.kube/config but usually its not recommended.

gtmtech commented 6 years ago

Who wants to do a rolling-update straight after provisioning a cluster? kops should provision the correct server entries in the kubectl config file in the first place - Given that kops creates a dns entry just fine with a sensible name e.g. api.cluster.mydomain.net (as an alias record to the elb/alb), why isnt kops export kubecfg using the alias record in the server and not the elb? This alias record is already in the certificate as OP says, and if kops generates a kubectl config entry using a server: https://[alias record], then it works just fine, and no rolling-updates or post-shenanigans are needed.

This should work out of the box

mbolek commented 6 years ago

#kops version
Version 1.9.2 (git-cb54c6a52

Ok... so I though I had something but it seems the issue persists. You need to export config to fix the API server endpoint and you need to roll master to fix the SSL cert

drzero42 commented 6 years ago

Another workaround that does not require waiting to roll the master(s) is to create the ELB, then update the cluster and then do the rest of the terraform apply. Steps are:

Create cluster as usual
Create internet gateway, or ELB will fail to deploy: terraform apply -target aws_internet_gateway.CLUSTERNAME-k8s-local
Create ELB: terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local
Update cluster (which will catch the DNS name for the ELB and issue a new master cert, as well as export a new kubecfg): kops update cluster --out=. --target=terraform
Create everything else: terraform apply

mshivanna commented 6 years ago

@mbolek the issue indeed persists kops version Version 1.10.0

korenDevops commented 6 years ago

@drzero42 - Thanks for the tip! it works, but you forgot to prefix -target on the 2nd apply step i.e.:

Create ELB: terraform apply aws_elb.api-CLUSTERNAME-k8s-local

should be: Create ELB: terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local

drzero42 commented 6 years ago

@CosmoPennypacker Absolutely right, good catch. I've updated my comment so people can more easily copy/paste from it ;)

elliottgorrell commented 5 years ago

For our creation script we have implemented this work around however as we don't care about downtime we do the following to speed up creation from ~20minutes to ~5 minutes

kops rolling-update cluster --cloudonly --force --master-interval=1s --node-interval=1s --yes

# Wait until all nodes come back online before marking complete
until kops validate cluster --name ${CLUSTER_NAME} > /dev/null
do
  echo "\033[1;93mWaiting until cluster comes back online\033[0m"
  sleep 5
done

echo "\033[1;92mCluster Creation Complete!\033[0m"

teagy commented 5 years ago

Technically, if you template the steps that generate the API certificate, you could feed the ELB DNS name output in terraform to the script before it generates it initially and stores it in the state store.

tkatrichenko commented 5 years ago

@teagy-cr and do you know how to do that?

jfreymann commented 5 years ago

This is still a valid issue, using the workaround outlined above.

MBalazs90 commented 5 years ago

This issue is still persist...

bnopacheco commented 5 years ago

This issue is still persist... Kops version = 1.11.1

kops validate cluster

Using cluster from kubectl context: milkyway.k8s.local Validating cluster milkyway.k8s.local

unexpected error during validation: error listing nodes: Get https://api.milkyway.k8s.local/api/v1/nodes: dial tcp: lookup api.milkyway.k8s.local on 192.168.88.1:53: no such host

The configuration generated by kops and terraform continue to treat the API endpoint as a DNS .k8s.local and not with the ELB

mbolek commented 5 years ago

@lieut-data, @gabrieljackson

# kops validate cluster dev2.k8s.local
Validating cluster dev2.k8s.local

unexpected error during validation: error listing nodes: Get https://api.dev2.k8s.local/api/v1/nodes: dial tcp: lookup api.dev2.k8s.local on 192.168.1.1:53: no such host
# kops version
Version 1.12.1 (git-e1c317f9c)

:( and then

kops update cluster dev2.k8s.local --target=terraform --out=.
I0528 09:50:11.688046    5454 apply_cluster.go:559] Gossip DNS: skipping DNS validation
I0528 09:50:13.463433    5454 executor.go:103] Tasks: 0 done / 95 total; 46 can run
I0528 09:50:14.477643    5454 executor.go:103] Tasks: 46 done / 95 total; 27 can run
I0528 09:50:15.245765    5454 executor.go:103] Tasks: 73 done / 95 total; 18 can run
I0528 09:50:16.011278    5454 executor.go:103] Tasks: 91 done / 95 total; 3 can run
I0528 09:50:17.764038    5454 vfs_castore.go:729] Issuing new certificate: "master"
I0528 09:50:19.326700    5454 executor.go:103] Tasks: 94 done / 95 total; 1 can run
I0528 09:50:19.327570    5454 executor.go:103] Tasks: 95 done / 95 total; 0 can run
I0528 09:50:19.348221    5454 target.go:312] Terraform output is in .
I0528 09:50:19.573892    5454 update_cluster.go:291] Exporting kubecfg for cluster
kops has set your kubectl context to dev2.k8s.local

Terraform output has been placed into .

Changes may require instances to restart: kops rolling-update cluster

So it still has to recreate the master cert

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

lmserrano commented 5 years ago

This issue still persists.

# kops version
Version 1.13.0 (git-be5fb9019)

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

rplahn commented 5 years ago

Still seeing this issue. Can this be taken out of /rotten /stale status? kops version Version 1.14.0

Stelminator commented 5 years ago

Still seeing this issue. Can this be taken out of /rotten /stale status? kops version Version 1.14.0

/remove-lifecycle rotten

linecolumn commented 4 years ago

Still seeing this issue.

Version 1.17.0-alpha.1 (git-501baf7e5) Terraform v0.12.16

provider.aws v2.40.0

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kops/issues/2990#issuecomment-619654290): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

esimonov commented 4 years ago

Still seeing this issue.

Kops - Version 1.18.2 Terraform - v0.13.4

djha736 commented 3 years ago

Hi Team,

Still i am getting certificate issue.

i am using Kops verion v1.19 terraform version v0.13.5

kmichailg commented 3 years ago

Currently experiencing this kops -> Version 1.18.2 (git-84495481e4) terraform -> v0.13.5

Will try using terraform:0.14.0

kmichailg commented 3 years ago

Tried terraform:0.14.0

Still capturing the .k8s.local instead of the correct ELB address.

/reopen

k8s-ci-robot commented 3 years ago

@kmichailg: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kops/issues/2990#issuecomment-738682342): >Tried `terraform:0.14.0` > >Still capturing the `.k8s.local` instead of the correct ELB address. > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kops