poseidon / typhoon

Minimal and free Kubernetes distribution with Terraform
https://typhoon.psdn.io/
MIT License
1.95k stars 322 forks source link

Typhoon does not start bootkube #232

Closed adrian-securosis closed 6 years ago

adrian-securosis commented 6 years ago

Bug

module.aws-tempest.module.workers.aws_lb_listener.ingress-https: Creating... arn: "" => "" default_action.#: "" => "1" default_action.0.target_group_arn: "" => "arn:aws:elasticloadbalancing:us-west-2:121741712844:targetgroup/tempest-workers-https/f3d1961f202e7f56" default_action.0.type: "" => "forward" load_balancer_arn: "" => "arn:aws:elasticloadbalancing:us-west-2:121741712844:loadbalancer/net/tempest-ingress/ec3b402b109abd48" port: "" => "443" protocol: "" => "TCP" ssl_policy: "" => "" module.aws-tempest.module.workers.aws_lb_listener.ingress-https: Creation complete after 1s (ID: arn:aws:elasticloadbalancing:us-west-2:...ress/ec3b402b109abd48/ef245d91e5b4973e) module.aws-tempest.module.workers.aws_lb_listener.ingress-http: Creation complete after 1s (ID: arn:aws:elasticloadbalancing:us-west-2:...ress/ec3b402b109abd48/6f0ae73b4c501b94) module.aws-tempest.aws_route53_record.apiserver: Still creating... (20s elapsed) module.aws-tempest.aws_route53_record.apiserver: Still creating... (30s elapsed) module.aws-tempest.aws_route53_record.apiserver: Still creating... (40s elapsed) module.aws-tempest.aws_route53_record.apiserver: Still creating... (50s elapsed) module.aws-tempest.aws_route53_record.apiserver: Still creating... (1m0s elapsed) module.aws-tempest.aws_route53_record.apiserver: Still creating... (1m10s elapsed) module.aws-tempest.aws_route53_record.apiserver: Still creating... (1m20s elapsed) module.aws-tempest.aws_route53_record.apiserver: Creation complete after 1m28s (ID: Z202GOHHUL8IHS_tempest.was.example.com._A) module.aws-tempest.null_resource.bootkube-start: Creating... module.aws-tempest.null_resource.bootkube-start: Provisioning with 'file'... module.aws-tempest.null_resource.bootkube-start: Provisioning with 'remote-exec'... module.aws-tempest.null_resource.bootkube-start (remote-exec): Connecting to remote host via SSH... module.aws-tempest.null_resource.bootkube-start (remote-exec): Host: 54.245.19.133 module.aws-tempest.null_resource.bootkube-start (remote-exec): User: core module.aws-tempest.null_resource.bootkube-start (remote-exec): Password: false module.aws-tempest.null_resource.bootkube-start (remote-exec): Private key: false module.aws-tempest.null_resource.bootkube-start (remote-exec): SSH Agent: true module.aws-tempest.null_resource.bootkube-start (remote-exec): Checking Host Key: false module.aws-tempest.null_resource.bootkube-start (remote-exec): Connected! module.aws-tempest.null_resource.bootkube-start: Still creating... (10s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (20s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (30s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (40s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (50s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (1m0s elapsed)

...

module.aws-tempest.null_resource.bootkube-start: Still creating... (18m50s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (19m0s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (19m10s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (19m20s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (19m30s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (19m40s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (19m50s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (20m0s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (20m10s elapsed) module.aws-tempest.null_resource.bootkube-start: Still creating... (20m20s elapsed) module.aws-tempest.null_resource.bootkube-start (remote-exec): Job for bootkube.service failed because the control process exited with error code. module.aws-tempest.null_resource.bootkube-start (remote-exec): See "systemctl status bootkube.service" and "journalctl -xe" for details.

Error: Error applying plan:

1 error(s) occurred:

Terraform does not automatically rollback in the face of errors. Instead, your Terraform state file has been partially updated with any resources that successfully completed. Please address the error above and apply again to incrementally change your infrastructure.

Environment

finally, journalctl -xe May 30 22:32:22 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:22.270292 771 libcurl_http_fetcher.cc:164] Setting up curl options for HTTPS May 30 22:32:22 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:22.270382 771 libcurl_http_fetcher.cc:427] Setting up timeout source: 1 seconds. May 30 22:32:22 ip-10-0-0-250.us-west-2.compute.internal locksmithd[809]: LastCheckedTime=1527716865 Progress=0 CurrentOperation="UPDATE_STATUS_CHECKING_FOR_UPDATE" NewVersion=0.0.0 NewSize=0 May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.529968 771 libcurl_http_fetcher.cc:240] HTTP response code: 200 May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.532069 771 libcurl_http_fetcher.cc:297] Transfer completed (200), 287 bytes downloaded May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.532233 771 omaha_request_action.cc:592] Omaha request response: <?xml version="1.0" encoding="UTF-8"?> May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.532907 771 omaha_request_action.cc:386] No update. May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.533053 771 action_processor.cc:82] ActionProcessor::ActionComplete: finished OmahaRequestAction, starting OmahaResponseHandlerAction May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.533161 771 omaha_response_handler_action.cc:36] There are no updates. Aborting. May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.533277 771 action_processor.cc:68] ActionProcessor::ActionComplete: OmahaResponseHandlerAction action failed. Aborting processing. May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.533385 771 action_processor.cc:73] ActionProcessor::ActionComplete: finished last action of type OmahaResponseHandlerAction May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.533490 771 update_attempter.cc:290] Processing Done. May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.533671 771 update_attempter.cc:326] No update. May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal update_engine[771]: I0530 22:32:23.533793 771 update_check_scheduler.cc:74] Next update check in 44m37s May 30 22:32:23 ip-10-0-0-250.us-west-2.compute.internal locksmithd[809]: LastCheckedTime=1527719543 Progress=0 CurrentOperation="UPDATE_STATUS_IDLE" NewVersion=0.0.0 NewSize=0 May 30 22:40:28 ip-10-0-0-250.us-west-2.compute.internal systemd-timesyncd[724]: Network configuration changed, trying to establish connection. May 30 22:40:28 ip-10-0-0-250.us-west-2.compute.internal systemd-timesyncd[724]: Synchronized to time server 138.236.128.112:123 (0.coreos.pool.ntp.org).

Problem

I'm trying to run Typhoon in AWS, setting up my launch point in AWS on a dedicated server in a public subnet. In the dozen attempts I've made it appears bookkube never starts. The behavior looks very similar to the following problem description:https://github.com/poseidon/typhoon/issues/167

  1. t2.micro for workers -- could this be a problem?
  2. I must use "terraform import aws_default_vpc.default vpc-XXXXXXXX" as my account predates default VPCs but I have tried specifying the default with this command, and I have also tried specifying another VPC in my account. The change makes no difference.
  3. I've tried running the install from both an AWS Linux installation and (what I believe to be) the same CoreOS ami (ami-574f362f) that Typhoon uses for the controller and workers. I assume either should work.
  4. I've reviewed this post (https://github.com/poseidon/typhoon/issues/167) and I believe I've addressed the issues mentioned there.
  5. I have the asset direct listed as the same directory as I am launching terraform from.

Any input would be greatly appreciated!! -Adrian

Here is some of the (previously) requested output:

Here is my providers.tf file -- only I truncated the public key cat providers.tf

provider "aws" { version = "~> 1.13.0" alias = "default"

region = "us-west-2" shared_credentials_file = "/home/user/.aws/credentials" }

resource "aws_default_vpc" "default" { tags { Name = "Default VPC" } }

provider "local" { version = "~> 1.0" alias = "default" }

provider "null" { version = "~> 1.0" alias = "default" }

provider "template" { version = "~> 1.0" alias = "default" }

provider "tls" { version = "~> 1.0" alias = "default" }

module "aws-tempest" { source = "git::https://github.com/poseidon/typhoon//aws/container-linux/kubernetes?ref=v1.10.3"

providers = { aws = "aws.default" local = "local.default" null = "null.default" template = "template.default" tls = "tls.default" }

AWS

cluster_name = "tempest" dns_zone = "was.example.com" dns_zone_id = "Z202GOHHUL8IHS"

configuration

ssh_authorized_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADA...core@ip-10-0-0-250.us-west-2.compute.internal" asset_dir = "/home/core/clusters"

optional

worker_count = 2 worker_type = "t2.micro" }

Desired Behavior

Determine why bootkube fails to start.

Steps to Reproduce

Provide clear steps to reproduce the issue unless already covered.

Feature Request

Feature

Describe the feature and what problem it solves.

Tradeoffs

What are the pros and cons of this feature? How will it be exercised and maintained?

adrian-securosis commented 6 years ago

One last point: I am following the steps described in this web page to perform the installation.

https://typhoon.psdn.io/cl/aws/#variables

Are there are other prep steps / downloads / services not listed here?

dghubble commented 6 years ago

The default controller EC2 instance type is t2.small because it is the smallest/cheapest that functions.

I'm not entirely sure what you're up to with your VPC and image comment. Each Typhoon cluster creates its own VPC, there is no use of the default VPC and you don't get to pick your own VPC. You also just set os_image to something like coreos-stable, there's no picking AMIs by hand.

Can you cleanup the report a bit? Probably don't need full system logs for this, just journalctl -u bootkube.service. Also, seeing no logs means you're probably on the wrong node. The one-time bootstrap is done on the 0th controller. That stuff should get you started.

dghubble commented 6 years ago

Oh, a few corrections.

Basically, let's get the bootkube.service logs. Everything else is just guessing in the dark. Now that I see you're running terraform from an EC2 instance as well, I think you're trying to journalctl on that same box (missed this bc most folks run terraform from their laptop). You wanna ssh to the 0th controller in the cluster. I think that's why you see nothing.

adrian-securosis commented 6 years ago

Thank you. I'll up the size to small just to make sure that is not an issue.

I'll set up a new cluster today and get back to you with the requested logs.

Last thing - if I do NOT run ... terraform import aws_default_vpc.default vpc-XXXXXXX ... and have the following associated entry in the providers.tf file ... resource "aws_default_vpc" "default" { tags { Name = "Default VPC" } } ... the install will fail with a 'failed to find default VPC'. So maybe not Typhoon but something in Terraform? No idea. But I'll keep setting it as it's no trouble, but I did want to make sure that was not the cause of some indeterminate issue. -Adrian

adrian-securosis commented 6 years ago

New cluster created. Same error: module.aws-tempest.null_resource.bootkube-start: Still creating... (20m20s elapsed) module.aws-tempest.null_resource.bootkube-start (remote-exec): Job for bootkube.service failed because the control process exited with error code. module.aws-tempest.null_resource.bootkube-start (remote-exec): See "systemctl status bootkube.service" and "journalctl -xe" for details.

Error: Error applying plan:

1 error(s) occurred:

Terraform does not automatically rollback in the face of errors. Instead, your Terraform state file has been partially updated with any resources that successfully completed. Please address the error above and apply again to incrementally change your infrastructure.

Log into the controller node: Last login: Thu May 31 17:26:23 UTC 2018 from 34.220.43.233 on pts/0 Container Linux by CoreOS stable (1745.4.0) Update Strategy: No Reboots Failed Units: 1 bootkube.service core@ip-10-0-4-150 ~ $ systemctl status bootkube.service ● bootkube.service - Bootstrap a Kubernetes cluster Loaded: loaded (/etc/systemd/system/bootkube.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2018-05-31 17:46:43 UTC; 11min ago Process: 1434 ExecStart=/opt/bootkube/bootkube-start (code=exited, status=1/FAILURE) Main PID: 1434 (code=exited, status=1/FAILURE)

May 31 17:46:43 ip-10-0-4-150 bootkube-start[1434]: [ 1495.548116] bootkube[6]: W0531 17:46:43.222525 6 create.go:57] Unable to determine api-server readiness: API Server http status: 0 May 31 17:46:43 ip-10-0-4-150 bootkube-start[1434]: [ 1495.551940] bootkube[6]: W0531 17:46:43.227117 6 create.go:57] Unable to determine api-server readiness: API Server http status: 0 May 31 17:46:43 ip-10-0-4-150 bootkube-start[1434]: [ 1495.552822] bootkube[6]: E0531 17:46:43.228001 6 create.go:66] API Server is not ready: timed out waiting for the condition May 31 17:46:43 ip-10-0-4-150 bootkube-start[1434]: [ 1495.553627] bootkube[6]: Error: API Server is not ready: timed out waiting for the condition May 31 17:46:43 ip-10-0-4-150 bootkube-start[1434]: [ 1495.554457] bootkube[6]: Tearing down temporary bootstrap control plane... May 31 17:46:43 ip-10-0-4-150 bootkube-start[1434]: [ 1495.556423] bootkube[6]: Error: API Server is not ready: timed out waiting for the condition May 31 17:46:43 ip-10-0-4-150 bootkube-start[1434]: [ 1495.557318] bootkube[6]: Error: API Server is not ready: timed out waiting for the condition May 31 17:46:43 ip-10-0-4-150 systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE May 31 17:46:43 ip-10-0-4-150 systemd[1]: bootkube.service: Failed with result 'exit-code'. May 31 17:46:43 ip-10-0-4-150 systemd[1]: Failed to start Bootstrap a Kubernetes cluster.

And the journalctl -xe command results: (And forgive me for the verbosity ... if there is a proper way to format these logs so they are not so ugly please let me know): May 31 18:04:24 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:24.707213 W | pkg/netutil: failed resolving host tempest-etcd0.was.example.com:2380 (lookup tempest-etcd0.was.example.com on 10.0.0.2:53: no such host); retrying in 1s May 31 18:04:24 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:24.752030 911 eviction_manager.go:247] eviction manager: failed to get get summary stats: failed to get node info: node "ip-10-0-4-150" not found May 31 18:04:24 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:24.787061 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was. May 31 18:04:24 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:24.789366 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list v1.Pod: Get https://tempest.was.example.com:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-4-150&limit=500&reso May 31 18:04:24 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:24.789589 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list v1.Node: Get https://tempest.was.example.com:443/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-4-150&limit=500&resourceVe May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: I0531 18:04:25.337269 911 kubelet_node_status.go:271] Setting node annotation to enable volume controller attach/detach May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: I0531 18:04:25.340613 911 kubelet_node_status.go:82] Attempting to register node ip-10-0-4-150 May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:25.342543 911 kubelet_node_status.go:106] Unable to register node "ip-10-0-4-150" with API server: Post https://tempest.was.example.com:443/api/v1/nodes: dial tcp: lookup tempest.was.example.com on 10.0.0.2:53: no such h May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: W0531 18:04:25.550963 911 cni.go:171] Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:25.551344 911 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized May 31 18:04:25 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:25.709688 W | pkg/netutil: failed resolving host tempest-etcd0.was.example.com:2380 (lookup tempest-etcd0.was.example.com on 10.0.0.2:53: no such host); retrying in 1s May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:25.789886 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list v1.Pod: Get https://tempest.was.example.com:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-4-150&limit=500&reso May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:25.790533 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was. May 31 18:04:25 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:25.792472 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list v1.Node: Get https://tempest.was.example.com:443/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-4-150&limit=500&resourceVe May 31 18:04:26 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:26.712118 W | pkg/netutil: failed resolving host tempest-etcd0.was.example.com:2380 (lookup tempest-etcd0.was.example.com on 10.0.0.2:53: no such host); retrying in 1s May 31 18:04:26 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:26.792696 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was. May 31 18:04:26 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:26.793356 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list v1.Pod: Get https://tempest.was.example.com:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-4-150&limit=500&reso May 31 18:04:26 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:26.795441 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list v1.Node: Get https://tempest.was.example.com:443/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-4-150&limit=500&resourceVe May 31 18:04:27 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:27.714474 W | pkg/netutil: failed resolving host tempest-etcd0.was.example.com:2380 (lookup tempest-etcd0.was.example.com on 10.0.0.2:53: no such host); retrying in 1s May 31 18:04:27 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:27.795455 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list v1.Pod: Get https://tempest.was.example.com:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-4-150&limit=500&reso May 31 18:04:27 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:27.796108 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was. May 31 18:04:27 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:27.798266 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list v1.Node: Get https://tempest.was.example.com:443/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-4-150&limit=500&resourceVe May 31 18:04:28 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:28.716621 W | pkg/netutil: failed resolving host tempest-etcd0.was.example.com:2380 (lookup tempest-etcd0.was.example.com on 10.0.0.2:53: no such host); retrying in 1s May 31 18:04:28 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:28.797996 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was. May 31 18:04:28 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:28.798669 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list v1.Pod: Get https://tempest.was.example.com:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-4-150&limit=500&reso May 31 18:04:28 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:28.800398 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list v1.Node: Get https://tempest.was.example.com:443/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-4-150&limit=500&resourceVe May 31 18:04:28 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:28.943578 911 event.go:209] Unable to write event: 'Patch https://tempest.was.example.com:443/api/v1/namespaces/default/events/ip-10-0-4-150.1533ca87a00f4568: dial tcp: lookup tempest.was.example.com on 10.0.0.2:53: no s May 31 18:04:29 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:29.718872 W | pkg/netutil: failed resolving host tempest-etcd0.was.example.com:2380 (lookup tempest-etcd0.was.example.com on 10.0.0.2:53: no such host); retrying in 1s May 31 18:04:29 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:29.800647 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list v1.Pod: Get https://tempest.was.example.com:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-4-150&limit=500&reso May 31 18:04:29 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:29.801322 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was. May 31 18:04:29 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:29.803220 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list v1.Node: Get https://tempest.was.example.com:443/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-4-150&limit=500&resourceVe May 31 18:04:30 ip-10-0-4-150 kubelet-wrapper[911]: W0531 18:04:30.552529 911 cni.go:171] Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d May 31 18:04:30 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:30.552891 911 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized May 31 18:04:30 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:30.721050 W | pkg/netutil: failed resolving host tempest-etcd0.was.example.com:2380 (lookup tempest-etcd0.was.example.com on 10.0.0.2:53: no such host); retrying in 1s May 31 18:04:30 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:30.803433 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was. May 31 18:04:30 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:30.804140 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list v1.Pod: Get https://tempest.was.example.com:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-0-4-150&limit=500&reso May 31 18:04:30 ip-10-0-4-150 kubelet-wrapper[911]: E0531 18:04:30.806881 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list *v1.Node: Get https://tempest.was.example.com:443/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-0-4-150&limit=500&resourceVe May 31 18:04:31 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:31.649633 E | pkg/netutil: could not resolve host tempest-etcd0.was.example.com:2380 May 31 18:04:31 ip-10-0-4-150 etcd-wrapper[4781]: 2018-05-31 18:04:31.650403 C | etcdmain: failed to resolve https://tempest-etcd0.was.example.com:2380 to match --initial-cluster=etcd0=https://tempest-etcd0.was.example.com:2380 (failed to resolve "https://tempest-etcd0.was.example.com:23 May 31 18:04:31 ip-10-0-4-150 systemd[1]: etcd-member.service: Main process exited, code=exited, status=1/FAILURE May 31 18:04:31 ip-10-0-4-150 systemd[1]: etcd-member.service: Failed with result 'exit-code'. May 31 18:04:31 ip-10-0-4-150 systemd[1]: Failed to start etcd (System Application Container). -- Subject: Unit etcd-member.service has failed


Final thoughts: Unable to tell if this was a service that failed to start or a networking error. I did check Route53 and installer creates the A record (tempest-etcd0.was.example.com), and the A record alias (tempest.was.example.com), SOA (was.example.com) and the NS records correctly. And ports 2379-2380 are allowed in the security group config, so it seems that the networking bits are correct.

Thoughts? Anything other data I can provide?

dghubble commented 6 years ago

You can place logs in a gist or wrap in triple ticks (```).

General advice: If bootkube.service can't reach apiserver, look at some lower-level components. journalctl -u etcd-member and journalctl -u kubelet. Also, in the AWS console, look at the apiserver's NLB and see if it lists the instances at healthy, otherwise bootkube will never be able to reach it.

Advice for your case: Its definitely a problem that apiserver can't resolve the DNS record for etcd: failed resolving host tempest-etcd0.was.example.com:2380. Can you check that the A record was created, it points to the controller, and you can resolve it yourself (ssh to the controller and dig). Check the etcd logs too. Something is weird here.

Also look at how your Terraform definition may differ from the tutorial or make some assumptions. Typhoon mostly eliminates possible ways of misconfiguring things and failures and needing to debug aren't a normal thing (just for context). I deployed an AWS cluster yesterday too.

dghubble commented 6 years ago

In regard to the default VPC thing, I'm not familiar with that. Typhoon shouldn't be using the default VPC anywhere. I'd be curious if it is trying to somewhere. Heck, if I could delete my default VPC, I would, but AWS doesn't seem to allow it.

adrian-securosis commented 6 years ago

The route53 info on was.example.com is as follows (sole hosted zone in this environment):

was.example.com. | SOA | ns-1536.awsdns-00.co.uk. awsdns-hostmaster.ama ... tempest-etcd0.was.example.com. | A | 10.0.4.150 tempest.was.example.com. | A |  ALIAS tempest-apiserver-a8adaa16494a272a.elb.us-west-2.amazonaws.com. (z18d5fsroun65g) |  

This looks to me like it points to the controller but I am not an expert here. Also to note the "Alias Hosted Zone ID: Z18D5FSROUN65G" -- this seems to be associated with both the tempest-apiserver and tempest-ingress -- as these are public facing maybe z18d5fsroun65g was created for them as part of the install process for the load balancers. It is not the same as the the hosted dns_zone_id = "Z202GOHHUL8IHS" but I think this as it should be.

dig tempest-etcd0.was.example.com

; <<>> DiG 9.10.2-P4 <<>> tempest-etcd0.was.example.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5779 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;tempest-etcd0.was.example.com. IN A

;; AUTHORITY SECTION: example.com. 32 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2018050804 7200 3600 1209600 3600

;; Query time: 0 msec ;; SERVER: 10.0.0.2#53(10.0.0.2) ;; WHEN: Thu May 31 22:16:14 UTC 2018 ;; MSG SIZE rcvd: 115

What is server 10.0.0.2? OK - now we are onto something.

journalctl -u kubelet is showing something very wrong -- it is a very long list of this same error message: May 31 22:16:58 ip-10-0-4-150 kubelet-wrapper[911]: E0531 22:16:58.819005 911 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list *v1.Service: Get https://tempest.was.example.com:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup tempest.was.example.com on 10.0.0.2:53: no such host

Which is correct as 10.0.0.2 is not associated with any IP, public or private, in my environment. Where would it get that IP? Workers (10.0.13.93, 10.0.47.10) and controller (10.0.4.150). My 'launchpads' for setting up the clusters are (10.0.0.250, 10.0.0.172). I have not allocated an EIP with this address. Any idea why it would be looking for this IP?

Not sure how to fix this or what would cause it.

My Terraform template is cut directly from the tutorial page, and I have added/changed the following elements: region = "us-west-2" shared_credentials_file = "/home/user/.aws/credentials dns_zone = "was.example.com" dns_zone_id = "Z202GOHHUL8IHS" ssh_authorized_key asset_dir resource "aws_default_vpc" "default" { tags { Name = "Default VPC" } } Nothing else has been changed.

Thanks!

adrian-securosis commented 6 years ago

I have a successful build of the cluster.

I noticed that the cluster did not associate itself with the private hosted domain so I did two things:

Removed all references to default VPC. Manually associated the VPC with the hosted zone just after the new VPC was created.

It's possible the default VPC fail was a one time spurious thing. No idea why the association did not occur on the last couple builds but manual association did the trick.

Closing this report.

emnipetro commented 6 years ago

I can't stop Windows update services by using the above method. I stop the services in the properties option. https://www.hptechnicalsupportphonenumbersusa.com/hp-customer-care/