geomarsi commented 6 years ago

Following Kubernetes v1.11 documentation, I have managed to setup Kubernetes high availability using kubeadm, stacked control plane nodes, with 3 masters running on-premises on CentOS7 VMs. But with no load-balancer available, I used Keepalived to set a failover virtual IP (10.171.4.12) for apiserver as described in Kubernetes v1.10 documentation. As a result, my "kubeadm-config.yaml" used to boostrap the control planes had the following header:

apiVersion: kubeadm.k8s.io/v1alpha2  
kind: MasterConfiguration  
kubernetesVersion: v1.11.0  
apiServerCertSANs:  
- "10.171.4.12"  
api:  
    controlPlaneEndpoint: "10.171.4.12:6443"  
etcd:  
  ...

The configuration went fine with the following Warning that appeared when boostrapping all 3 Masters:

[endpoint] WARNING: port specified in api.controlPlaneEndpoint overrides api.bindPort in the controlplane address

And this Warning when joining Workers:

[WARNING RequiredIPVSKernelModulesAvailable]: the IPVS proxier will not be used, because the following required kernel modules are not loaded: [ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh] or no builtin kernel ipvs support: map[ip_vs:{} ip_vs_rr:{} ip_vs_wrr:{} ip_vs_sh:{} nf_conntrack_ipv4:{}] you can solve this problem with following methods:

Run 'modprobe -- ' to load missing kernel modules;

Provide the missing builtin kernel ipvs support

Afterwards, basic tests succeed:

When stopped, Keepalived is failing over to another Master and apiserver is always accessible (all kubectl commands succeed).
When stopping the main Master (with highest Keepalived preference), the deployment of apps is successful (tested with Kubernetes bootcamp) and everything syncs properly with the main Master when it is back online.
Kubernetes bootcamp application runs successfully, and all master & worker nodes respond properly when the service exposing bootcamp with NodePort is curled.
Successfully deployed docker-registry as per https://github.com/kubernetes/ingress-nginx/tree/master/docs/examples/docker-registry

But then comes these issues:

Nginx Ingress Controller pod fails to run and enters state CrashLoopBackOff (refer to events below)
After installing helm and tiller on any Master, all commands using "helm install" or "helm list" failed to execute (refer to command ouputs below)

I am running Kubernetes v1.11.1 but kubeadm-config.yaml mentions 1.11.0, is this something I should worry about?

Shall I not follow the official documentation and go for other alternatives such as described at: https://medium.com/@bambash/ha-kubernetes-cluster-via-kubeadm-b2133360b198

Important Notes: After running couple of labs, I got the same issue with:

new Kubernetes HA installation using the latest version 1.11.2 (three masters + one worker) and nginx latest ingress controller release 0.18.0.
standalone Kubernetes master with few workers using version 1.11.1 (one master + two workers) and nginx latest ingress controller release 0.18.0.
but with standalone Kubernetes master version 1.11.0 (one master + two workers), nginx ingress controller 0.17.1 worked with no complaints while 0.18.0 complained that Readiness probe failed but the pod went into the running state. => As a result, I think the issue is related to kubernetes releases 1.11.1 and 1.11.2 in the way they interpret the health probes maybe

-- Nginx controller pod events & logs:

  Normal   Pulled     28m (x38 over 2h)  kubelet, node3.local  Container image "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.17.1" already present on machine  
  Warning  Unhealthy  7m (x137 over 2h)  kubelet, node3.local  Liveness probe failed: Get http://10.240.3.14:10254/healthz: dial tcp 10.240.3.14:10254: connect: connection refused  
  Warning  BackOff    2m (x502 over 2h)  kubelet, node3.local  Back-off restarting failed container  

nginx version: nginx/1.13.12  
W0809 14:05:46.171066       5 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.  
I0809 14:05:46.171748       5 main.go:191] Creating API client for https://10.250.0.1:443

-- helm command outputs:

'# helm install ...  
Error: no available release name found

'# helm list  
Error: Get https://10.250.0.1:443/api/v1/namespaces/kube-system/configmaps?labelSelector=OWNER%!D(MISSING)TILLER: dial tcp 10.250.0.1:443: i/o timeout

-- kubernetes service & endpoints:

# kubectl describe svc kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.250.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.171.4.10:6443,10.171.4.8:6443,10.171.4.9:6443
Session Affinity:  None
Events:            <none>

# kubectl get endpoints --all-namespaces
NAMESPACE       NAME                      ENDPOINTS                                               AGE
default         bc-svc                    10.240.3.27:8080                                        6d
default         kubernetes                10.171.4.10:6443,10.171.4.8:6443,10.171.4.9:6443        7d
ingress-nginx   default-http-backend      10.240.3.24:8080                                        4d
kube-system     kube-controller-manager   <none>                                                  7d
kube-system     kube-dns                  10.240.2.4:53,10.240.2.5:53,10.240.2.4:53 + 1 more...   7d
kube-system     kube-scheduler            <none>                                                  7d
kube-system     tiller-deploy             10.240.3.25:44134                                       5d

geomarsi commented 6 years ago

@kubernetes/sig-api-machinery @kubernetes/kind-bug

geomarsi commented 6 years ago

/sig architecture /sig contributor-experience-test-failures /sig network /sig testing /wg kubeadm-adoption

neolit123 commented 6 years ago

/remove-wg kubeadm-adoption

geomarsi commented 6 years ago

Problems solved when switched my POD network from Flanneld to Calico. (tested on Kubernetes 1.11.0; will repeat tests tomorrow on latest k8s version 1.11.2)

geomarsi commented 6 years ago

Tests successful with k8s versions 1.11.1 and 1.11.2. No more issues with Calico.

hextrim commented 6 years ago

Hi,

I have the same problem with 1.11.3 and HA cluster detailed in https://kubernetes.io/docs/setup/independent/high-availability/

I have HAproxy LB.


    bind 192.168.1.30:6443
#    bind 127.0.0.1:443
    mode tcp
    option tcplog
    default_backend k8s-api-backend

backend k8s-api-backend
    mode tcp
    option tcplog
    option tcp-check
    balance roundrobin
    default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100```

    server ht-dkh-01 192.168.1.21:2380 check
    server ht-dkh-02 192.168.1.22:2380 check
    server ht-dkh-03 192.168.1.23:2380 check

On my 3rd node I get:
[root@ht-dkh-03 ~]# kubectl exec -n kube-system etcd-${CP0_HOSTNAME} -- etcdctl --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/peer.crt --key-file /etc/kubernetes/pki/etcd/peer.key --endpoints=https://${CP0_IP}:2379 member add ${CP2_HOSTNAME} https://${CP2_IP}:2380
Unable to connect to the server: x509: certificate is valid for ht-dkh-02, localhost, ht-dkh-02, not ht-dklb-01

I am really stuck with setting this up, should I follow a mixture of 1.10 and 1.11 documentation for this setup?

geomarsi commented 6 years ago

Hello @hextrim Did you first copy the content of /etc/kubernetes/... from Master1 to Master3 before dropping any kubectl and kubeadm commands on Master3? I can't tell what enhancements where made with 1.11.3, but based on my experience with 1.11.2, it failed with an external HAProxy LB and worked with Keepalived set on the Master nodes themselves as per 1.10 documentation.

hextrim commented 6 years ago

Hi @geomarsi I managed to setup "Stacked ETCD" behind HAproxy on CentOS7 7.5.1804: without issues using kubeadm, kubectl, kubelet v1.11.3 following the official documentation here: https://kubernetes.io/docs/setup/independent/high-availability/ .

Next step is to setup the cluster with "External ETCD"

timothysc commented 6 years ago

Closing per previous comment https://github.com/kubernetes/kubernetes/issues/67389#issuecomment-413300366

kubernetes / kubernetes

Issues with Kubernetes multi-master using kubeadm on premises #67389

Afterwards, basic tests succeed:

But then comes these issues:

-- Nginx controller pod events & logs:

-- helm command outputs:

-- kubernetes service & endpoints: