kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.85k stars 4.64k forks source link

Cluster not entirely spun up (API Record Not Created) #859

Closed petergarbers closed 7 years ago

petergarbers commented 7 years ago

I’m trying to setup a cluster in a new aws account using kops following this guide

I have noticed that 2 records aren’t being created. api.clustername.domain.com and api.internal.clustername.domain.com. I only have the domain records for the etcd service.

As a result I am unable to connect to my cluster using kubectl. From what I can tell the master and the other nodes are running, however, manually creating these domain records has been unfruitful. So I suspect there may be other issues.

yissachar commented 7 years ago

@justinsb @chrislovecnm I'm seeing a ton of reports with this issue in the past week on Slack. Anything changed recently that would cause this?

petergarbers commented 7 years ago

I feel like I should mention that mine resolved itself after ~20 minutes but I'm leaving this open as other people are still seeing the issue

shrabok-surge commented 7 years ago

I experienced this when I didn't have my NS records publishing the subdomain I was using. When the NS records were added the api records were created.

tomdavidson commented 7 years ago

@shrabok-surge can you be more specific? Maybe even use the example from http://kubernetes.io/docs/getting-started-guides/kops/

When you ref subdomain are you saying useast1 or dev in useast1.dev.example.com ?

tomdavidson commented 7 years ago

checkout out the terraform output there are no route 53 resources. how is the cluster subdomains suppose to be created?

juliendf commented 7 years ago

@petergarbers From a machine within the same VPC, are you able to resolve your domain : dig ns clustername.domain.com ?

cyberroadie commented 7 years ago

@CliMz I have the same problem (no api domain names, and do have etcd names). I successfully resolved dig ns clustername.domain.com from a machine within the same VPC

cyberroadie commented 7 years ago

I waited over an our but still no api domain :-/

I ssh'd into the admin node and in the /var/log/kube-apiserver.log i see: controller.go:88] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured

Is this relevant?

chrislovecnm commented 7 years ago

@cyberroadie please add you install command, kops version, aws region.

cyberroadie commented 7 years ago

Solved: Found out what the problem was: misconfiguration of the DNS subdomain. Logging in into the master node and looking in the /var/log/etcd.log file I could see the region.dev.xx.xx domain didn't get resolved. This prevented the etcd server from starting and subsequently prevented the api server from starting because it couldn't connect to the etcd cluster.

tomdavidson commented 7 years ago

@cyberroadie is this a bug or is this a subdomain you configured for for the cluster? For example, I have done exactly has described in http://kubernetes.io/docs/getting-started-guides/kops/ but have the same symptoms you have had.

cyberroadie commented 7 years ago

@tomdavidson It's not a bug. I made a mistake setting up the subdomain in route53

tomdavidson commented 7 years ago

@cyberroadie We are not the only ones so maybe we are all making the same mistake. Is there an issue with the steps in http://kubernetes.io/docs/getting-started-guides/kops/ ?

cyberroadie commented 7 years ago

I can describe to steps I took: I was creating a test run for setting up a kubernetes cluster. As described, if your not in control of the main domain e.g. testing.net you can create a hosted zone for the subdomain e.g. dev.testing.net. This will be the case in a future project. But now as a test I added two hosted zones in route53 with a domain I control myself. One was for testing.net and another for dev.testing.net. This didn't work. resolving it with dig ns dev.testing.net returned the dns server of testing.net and couldn't find dev.testing.net. So for the test I dropped the dev.testing.net hosted zone and let everything add to the 'testing.net' hosted zone. I gave priority to testing the cluster first; I still have to figure out how to do the sub domain hosted zone. Re-reading the documentation now, have to say I'm slightly puzzled how to setup the NS records correctly in this scenario.

cyberroadie commented 7 years ago

PS the etcd domain names where added to the dev.testing.net hosted zone

tomdavidson commented 7 years ago

Yes, Im conused about the NS too. In my case I deletated a zone to r53 - c.b.a.edu. Then with kops create I used a name such as tom.c.b.a.edu. etcd records were created for tom.c.b.a.edu but nothing else.

chrislovecnm commented 7 years ago

Can I get a status on where this issue is at? Not following the comments ;)

tomdavidson commented 7 years ago

@chrislovecnm I have not been able to confirm the NS is configured as needed by kops. I have done excatly as discribed in http://kubernetes.io/docs/getting-started-guides/kops/ but the direction is not the clear.

This is potentially all user error / unclear docs but until we can clarify the needed config we can not verify it is soley user error.

chrislovecnm commented 7 years ago

Take a look at this http://blog.couchbase.com/2016/november/multimaster-kubernetes-cluster-amazon-kops

We have an issue to drop something like this into our docs

cyberroadie commented 7 years ago

I'm going to have a play around this weekend to see if I can create more clarity. For the project I'm currently working on it would be ideal if every developer team has control over it's own subdomain and the main domain is controlled separately. (Both in route53)

Test scenario:

So far we know:

Acceptance criteria

Outcome:

PS You can find me on slack: #kubernetes-users

tomdavidson commented 7 years ago

Deleted my Route53 zones and created new ones. This time the api record was created. FYI the default limit on a new aws account kept my autoscale group from populating - encase there is a common problems section in the new docs.

chrislovecnm commented 7 years ago

@tomdavidson there is a troubleshooting guide, feel free to update

cyberroadie commented 7 years ago

Update: So I did my test scenario and here are the results: I will use example.com as an example domain:

1) Created two separate hosted zones in AWS Route53: One for example.com and one for dev.example.com

2) Setting up 'route delegation': (this is the proper name for it) Copy the 4 nameservers from the NS record of dev.example.com and create a new NS record in example.com and add these subdomain nameservers to it. Give dev.example.com as the name of the new NS record. After this is done the parent domain (example.com) will delegate all request for *.dev.example.com to the correct hosted zone in route53. Also if you create a new domain (e.g. test.dev.example.com) via AWS route53 command line tool., it will be added to the subdomain hosted zone.

3) After this you can setup Kubernetes (with Kops) and the new domain names (for etcd, etcd-event, etc, ...) will be added to the dev.example.com hosted zone.

The advantage of this is that you can hand over the control of a subdomain to another team without losing control over your parent domain.

Regarding the documentation I think it would be good to add 'delegating DNS request to a subdomain' and that in order to do that you have to create a seperate NS record in the parent domain hosted zone with the nameservers of the hosted zone of the subdomain.

One observation: with a setup like this all new subdomains where added almost instantly, I never had to wait more that a minute to see them appear in the subdomain hosted zone.

cyberroadie commented 7 years ago

This is a good article about DNS zone delegation

And here an answer on how to do it in AWS (Amazon networking forum)

MichaelJCole commented 7 years ago

Set out to make a PR to fix this. 27 bot-mails, 20 mins of signups, email confirmations, linking accounts, and a contract that needed my address. I've declined to work that hard to give you free work as a PR.

You may find this interesting on configuring Route53 subdomains

So, I had this problem, and I can verify my root cause and the fix:

What happened: I followed this and everything worked up until:

$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.cluster.stage.example.io on 127.0.1.1:53: no such host

Why? The api record wasn't configuring as described above.

Root cause: My DNS wasn't configured correctly. I had a parent domain example.io and a subdomain stage.example.io

Fix: Add a NS record to the parent domain for the subdomain, with the subdomains NS servers as described in the article above.

Thanks for the awesome tool :-)

ndtreviv commented 7 years ago

For people coming back to this issue: We did everything that @MichaelJCole did in advance of creating our clusters (ie: created NS records with the sub-domain NSs in the root domain hosted zone), and it still took about 20 mins for everything to come up.

It took a good while for the api* routes to be created, and even then took a while for the DNS records to propagate. kubectl get nodes was returning no such host all the time, then was successful 1 in every 4 times (note: there were 4 name servers), then worked more regularly, then eventually worked every time.

So, be aware:

  1. When you bring your cluster up, it takes a while for the api* A records to be created
  2. When the api* A records are created, it doesn't mean kubectl will work instantly
  3. When kubectl does start working, it may seemingly be intermittent whilst the DNS propagates
  4. Eventually, everything will be fine. Kick back, cool off, (open|pour) a (cold|hot) one
  5. This happens even if you're bringing up a second cluster on the same hosted zone (eg: hosted zone = k8s.mydomain.com; one cluster already exists at: useast1.user1.dev.k8s.mydomain.com; second cluster created at: useast.user2.dev.k8s.mydomain.com)
deitch commented 7 years ago

Glad to see I am not the only one with this issue. It did take a while.

How does it create the route53 record? I ran kops --target terraform followed by terraform apply, so everything was created via terraform. Yet there is no route53 resource anywhere in kubernetes.tf or the data/ dir.

jaygorrell commented 7 years ago

Masters manage the DNS - TF wouldn't know how to manage IPs of instances that may get replaced.

deitch commented 7 years ago

@jaygorrell the k8s master creates the route53 entry for api.subdom.mydomain.com? I thought that is set up at creation time by terraform? Is that why it sometimes takes a while, as opposed to being immediate (which would happen if terraform had done it)?

I begin to understand. :-)

2 questions:

  1. Are there k8s docs anywhere about creating the route53 entry?
  2. Since it knows how to integrate with route53, is there any reason it cannot be configured to create a CNAME entry for an ELB when it creates a Service with type=LoadBalancer?
jaygorrell commented 7 years ago

Terraform wouldn't know the IP before it's assigned to those instances and it api is a RR list of each IP - not a CNAME to an ELB or anything.

  1. There's a little bit on that here, that may give you terminology to dig deeper if you'd like: https://github.com/kubernetes/kops/blob/master/docs/boot-sequence.md#api-server-bringup
  2. That's exactly what https://github.com/Vungle/kube-route53 does -- I'm not sure if anything could be added to kops directly to support that or not, though.
deitch commented 7 years ago

Terraform wouldn't know the IP before it's assigned to those instances

Got that. It was the api. address that I was confused about.

api is a RR list of each IP

each IP? Isn't it a single one? Oh, you mean multiple masters? OK, got that. Makes sense.

There's a little bit on that here Thanks, that does help.

That's exactly what https://github.com/Vungle/kube-route53 does Does it? I am looking specifically for the ELB when service is type=NodeBalancer. It is that service that is most likely to be exposed to the outside world.

jaygorrell commented 7 years ago

Sorry, I failed at Google. Meant to link this one: https://github.com/wearemolecule/route53-kubernetes

deitch commented 7 years ago

Oooh, now that is interesting. Thanks @jaygorrell!

chrislovecnm commented 7 years ago

@jaygorrell that is also a lot of what dnscontroller does :) You already have that installed on a kops cluster.

jaygorrell commented 7 years ago

Ah yes - didn't realize there was a kops release a few days ago... been waiting on that one!

So this should work now, yes? https://github.com/kubernetes/kops/tree/master/dns-controller

deitch commented 7 years ago

Damn! I accidentally clicked close on this tab and it lost my comment. Don't know what GitHub does to prevent the browser from recognizing that there is entered text, but it is not wise!

OK, recreating:

As far as I can tell, it looks like:

Is that right? If so, I would love to try it.

chrislovecnm commented 7 years ago

@deitch looking to see if there is an issue open for better documentation.

deitch commented 7 years ago

Thanks @chrislovecnm

chrislovecnm commented 7 years ago

https://github.com/kubernetes/kops/issues/1230 <- lets talk there

pl1ght commented 7 years ago

Initially same issue here. What it boils down to is something I'm betting a LOT of us overlooked. When you create your initial Hosted Zone for your sub-domain, you get a DIFFERENT SET of NS records from AWS than what your Parent domain has. I initially copy and pasted in the same NS records from my Parent domain into my subdomain's NS record in the Parent hosted zone. Then I deleted, and copy and pasted the DIFFERENT NS records from my sub-domain hosted zone into the parent Hosted Zone for my sub-domain name NS record. Fixed the missing records instantly. Just re-ran the update cluster --yes and voila!

chrislovecnm commented 7 years ago

So can we close this?

petergarbers commented 7 years ago

Please do On Mon, Jan 2, 2017 at 19:45 Chris Love notifications@github.com wrote:

So can we close this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kops/issues/859#issuecomment-270037270, or mute the thread https://github.com/notifications/unsubscribe-auth/AANxzPr-DDYQSwpJA6JDQTrBMRrXW_lnks5rOZpCgaJpZM4KuDqr .

jitendrabhalothia commented 7 years ago

Check if the kops version is 1.5+, we need not to define the --dns-zone=experimental.com

I also got the same error bit after removing this --dns-zone=experimental.com, it works just define the cluster name thats all.

For an example: kops create cluster --name=kops-k8s-expt --state=s3://kops-k8s-experimental --zones=us-east-1a --node-count=2 --node-size=t2.micro --master-size=t2.micro

1ambda commented 6 years ago

--target=terraform also requires --dns-zone.

$ kops version
Version 1.7.0

$ terraform --version
Terraform v0.10.6
rares-urdea commented 5 years ago

Not sure if this is still relevant to anyone, but since I ran into the same issue twice while following the getting started guide, I'll leave it here.

The issue with being unable to resolve Kubernetes cluster API URL popped up in my case with just a parent domain and what looked like a properly configured api record; no sub-domains. I'm trying to validate the cluster from my local machine and I am using a Public Hosted Zone.

After several failed attempts, I ran a quick test in Route53 for the api record (go to hosted zone > api record > test record set) using my public IP as the resolver IP address. Running kops validate cluster immediately after this returned a valid cluster response. I'll note that it may simply have been a coincidence and enough time had passed for the issue to just resolve itself as @petergarbers mentioned above, but if not, and others run into this, give it a shot.