Cannot use terraform and gossip-based cluster at the same time

simnalamburt commented 7 years ago

If you create a cluster with both terraform and gossip options enabled, all kubectl commands shall fail.

How to reproduce the error

My environment

$ uname -a
Darwin *****.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64

$ kops version
Version 1.6.2

$ terraform version
Terraform v0.9.11

$ aws --version
aws-cli/1.11.117 Python/2.7.10 Darwin/16.6.0 botocore/1.5.80

Setting up the cluster

# Create RSA key
ssh-keygen -f shared_rsa -N ""

# Create S3 bucket
aws s3api create-bucket \
  --bucket=kops-temp \
  --region=ap-northeast-1 \
  --create-bucket-configuration LocationConstraint=ap-northeast-1

# Create terraform codes and some resources
# including *certificates* will be stored to S3
kops create cluster \
  --name=kops-temp.k8s.local \
  --state=s3://kops-temp \
  --zones=ap-northeast-1a,ap-northeast-1c \
  --ssh-public-key=./shared_rsa.pub \
  --out=. \
  --target=terraform

# Create cluster
terraform init
terraform plan -out ./create-cluster.plan
terraform show ./create-cluster.plan | less -R # final review
terraform apply ./create-cluster.plan # fire

# Done

Spoiler Alert: Creating the self-signed certificate before creating actual Kubernetes cluster is the root cause of this issue. Please continue to see why.

Scenario 1. Looking up non-existent domain

$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.kops-temp.k8s.local on 8.8.8.8:53: no such host

This is basically because of erroneous ~/.kube/config file. If you run the kops create cluster with both terraform and gossip options enabled, you'll get wrong ~/.kube/config file.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api.kops-temp.k8s.local
            # !!!! There's no such domain named "api.kops-temp.k8s.local"
  name: kops-temp.k8s.local
# ...

Let's manually correct that file. Or, you'll get good config file if you explicitly export the configuration once again.

kops export kubecfg kops-temp.k8s.local --state s3://kops-temp

Then the non-existent domain will be replaced with the ELB of master nodes' DNS name.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com
  name: kops-temp.k8s.local
# ...

And you'll be ended up to the scenario 2 when you retry.

Scenario 2. Invalid certificate

$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for api.internal.kops-temp.k8s.local, api.kops-temp.k8s.local, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, not api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com

This is simply because the DNS name of ELB is not included in the certificate. This scenario occurs only when you create the cluster with terraform option being enabled. If you try to create the cluster only with gossip option not using the terraform target, the self-signed certificate will properly contain the DNS name of ELB.

(Sorry for the Korean, this is the list of DNS alternative names of certificate)

The only way to workaround this problem is forcing "kops-temp.k8s.local" to point proper IP address via manually editing /etc/hosts, which is undesired for many people.

# Recover ~/.kube/config
perl -i -pe \
    's|api-kops-temp-k8s-local-rnvnqsr-666666\.ap-northeast-1\.elb\.amazonaws\.com|api.kops-temp.k8s.local|g' \
    ~/.kube/config

# Hack /etc/hosts
host api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com |
    perl -pe 's|^.* address (.*)$|\1\tapi.kops-temp.k8s.local|g' |
    sudo tee -a /etc/hosts

# This will succeed
kubectl get nodes

I'm not very familiar with Kops internal, but I expect a huge change to properly fix this issue. Maybe using AWS Certificate Manager can be a solution. (#834) Any ideas?

olemarkus commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@olemarkus: Reopened this issue.

In response to [this](https://github.com/kubernetes/kops/issues/2990#issuecomment-738687785): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kops/issues/2990#issuecomment-753593518): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kmichailg commented 3 years ago

Tested with terraform:0.14.3 and kops Version 1.18.2 (git-84495481e4)

Still capturing the .k8s.local instead of the correct ELB address. Workarounds doesn't seem to work.

Validation failed: unexpected error during validation: error listing nodes: an error on the server ("") has prevented the request from succeeding (get nodes)

Tried re-exporting the ELB endpoint:

kops export kubecfg --name ${CLUSTER_NAME} && \
kops update cluster ${CLUSTER_NAME} \
  --out=. \
  --target=terraform && \
terraform apply -auto-approve && \
kops rolling-update cluster ${CLUSTER_NAME} --cloudonly --force --yes

Doing this makes the master node seem to be stuck on the initializing status on AWS on a few occasions. Eventually becomes okay

Also tried creating the gateway and ELB first before using terraform apply same result:

kops create ...
...

terraform apply -target=aws_internet_gateway.${CLUSTER_PREFIX}-k8s-local -auto-approve && \
terraform apply -target=aws_elb.api-${CLUSTER_PREFIX}-k8s-local -auto-approve

kops update cluster \
  --out=. \
  --target=terraform

terraform apply -auto-approve && \
kops rolling-update cluster --cloudonly --force --master-interval=1s --node-interval=1s --yes

I am using t3a.small for the nodes, t3a.medium for the master node.

kmichailg commented 3 years ago

Still experiencing this regarding gossip-based clusters. Abandoning the infrastructure-as-code for now (via terraform), will just deploy via kops only.

Hopefully you reopen this for tracking. Thank you!

alen-z commented 3 years ago

Issue still persists. Great feature, but not usable at the moment.

alen-z commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@alen-z: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kops/issues/2990#issuecomment-805064974): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

rifelpet commented 3 years ago

/remove-lifecycle rotten

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 3 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/kops/issues/2990#issuecomment-920372390): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kops