Apiserver becomes unavailable : kube-proxy logs net/http: TLS handshake timeout, then dial tcp 10.61.78.17:44 i/o timeout3: connect: no route to host, or

dmcnaught commented 4 years ago

1. What kops version are you running? The command kops version, will display this information. 1.17.2 2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. 1.17.13 3. What cloud provider are you using? AWS 4. What commands did you run? What is the simplest way to reproduce this issue? It has been happening for a month or so. Currently cluster works for a week before getting the error and becoming unavailable. I've been able to get the cluster back up by running:

kops rolling-update cluster --instance-group-roles master --fail-on-validate-error="false" --cloudonly --force --yes

Thread in slack: https://kubernetes.slack.com/archives/C3QUFP0QM/p1603730470066300 5. What happened after the commands executed? N/A 6. What did you expect to happen? N/A 7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

» kops get cluster $NAME -oyaml                                                                                                                                                                 
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 17
  name: <redacted>
spec:
  additionalPolicies:
    node: |
      [
        {
          "Effect": "Allow",
          "Action": ["ec2:AttachVolume"],
          "Resource": ["*"]
        },
        {
          "Effect": "Allow",
          "Action": ["ec2:DetachVolume"],
          "Resource": ["*"]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    kubernetes.io/cluster/<redacted>: owned
    kubernetes.io/role/elb: "1"
  cloudProvider: aws
  configBase: s3://<redacted>
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    memoryRequest: 100Mi
    name: events
  fileAssets:
  - content: |
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
        - level: None
          users:
            - kops
            - kubelet
            - system:apiserver
            - system:kube-apiserver
            - system:kube-controller-manager
            - system:kube-proxy
            - system:kube-scheduler
            - system:serviceaccount:default:newrelic
            - system:serviceaccount:default:splunk-connect-splunk-kubernetes-metrics
            - system:serviceaccount:default:splunk-connect-splunk-kubernetes-objects
            - system:serviceaccount:deis:deis-builder
            - system:serviceaccount:deis:deis-logger-fluentd
            - system:serviceaccount:deis:deis-monitor-telegraf
            - system:serviceaccount:deis:deis-router
            - system:serviceaccount:deis:deis-workflow-manager
            - system:serviceaccount:kube-system:alb-ingress
            - system:serviceaccount:kube-system:cluster-autoscaler
            - system:serviceaccount:kube-system:cronjob-controller
            - system:serviceaccount:kube-system:daemon-set-controller
            - system:serviceaccount:kube-system:deployment-controller
            - system:serviceaccount:kube-system:dns-controller
            - system:serviceaccount:kube-system:endpoint-controller
            - system:serviceaccount:kube-system:generic-garbage-collector
            - system:serviceaccount:kube-system:heapster
            - system:serviceaccount:kube-system:horizontal-pod-autoscaler
            - system:serviceaccount:kube-system:kube-dns
            - system:serviceaccount:kube-system:kube-dns-autoscaler
            - system:serviceaccount:kube-system:metrics-server
            - system:serviceaccount:kube-system:node-controller
            - system:serviceaccount:kube-system:namespace-controller
            - system:serviceaccount:kube-system:pod-garbage-collector
            - system:serviceaccount:kube-system:resourcequota-controller
            - system:serviceaccount:kube-system:replicaset-controller
            - system:serviceaccount:kube-system:route-controller
            - system:serviceaccount:kube-system:service-controller
            - system:serviceaccount:kube-system:tiller-deploy
            - system:serviceaccount:kube-system:ttl-controller
            - system:serviceaccount:kube-system:weave-net
            - system:serviceaccount:monitoring:kube-state-metrics
            - system:serviceaccount:monitoring:node-exporter
            - system:serviceaccount:monitoring:prometheus-k8s
            - system:serviceaccount:monitoring:prometheus-operator
            - system:serviceaccount:default:splunk-connect-splunk-kubernetes-logging
            - system:unsecured
        # Default level for all other requests.
        - level: Metadata
          omitStages:
            - "RequestReceived"
    name: audit-policy-file
    path: /srv/kubernetes/audit.yaml
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    auditLogMaxAge: 10
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-dev1-audit.log
    auditPolicyFile: /srv/kubernetes/audit.yaml
    runtimeConfig:
      extensions/v1beta1/daemonsets: "true"
      extensions/v1beta1/deployments: "true"
      extensions/v1beta1/networkpoliciesr: "true"
      extensions/v1beta1/replicasets: "true"
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - <redacted>
  kubernetesVersion: 1.17.13
  masterInternalName: <redacted>
  masterPublicName: <redacted>
  networkCIDR: 10.61.0.0/16
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.5.0.0/16
  subnets:
  - cidr: 10.61.32.0/19
    name: us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 10.61.64.0/19
    name: us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 10.61.96.0/19
    name: us-east-1d
    type: Private
    zone: us-east-1d
  - cidr: 10.61.128.0/19
    name: us-east-1e
    type: Private
    zone: us-east-1e
  - cidr: 10.61.0.0/22
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  - cidr: 10.61.4.0/22
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  - cidr: 10.61.8.0/22
    name: utility-us-east-1d
    type: Utility
    zone: us-east-1d
  - cidr: 10.61.12.0/22
    name: utility-us-east-1e
    type: Utility
    zone: us-east-1e
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

dmcnaught commented 4 years ago

Instancegroups: Masters:

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2017-11-03T00:03:39Z"
  generation: 7
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-east-1b
spec:
  image: kope.io/k8s-1.17-debian-stretch-amd64-hvm-ebs-2020-01-17
  machineType: c3.xlarge
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-1b

Nodes:

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-01-07T22:32:30Z"
  generation: 10
  labels:
    kops.k8s.io/cluster: <redacted>
  name: c4xlarge-wellpass
spec:
  image: kope.io/k8s-1.17-debian-stretch-amd64-hvm-ebs-2020-01-17
  machineType: c4.xlarge
  maxSize: 30
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: c4xlarge-wellpass
    wellpass: "true"
  role: Node
  subnets:
  - us-east-1b
  - us-east-1c
  - us-east-1d
  - us-east-1e

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

olemarkus commented 3 years ago

Do still experience this problem?

dmcnaught commented 3 years ago

Adding more memory to the masters seems to be working, thanks @olemarkus

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kops/issues/10146#issuecomment-812027086): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kops

Apiserver becomes unavailable : kube-proxy logs net/http: TLS handshake timeout, then dial tcp 10.61.78.17:44 i/o timeout3: connect: no route to host, or #10146