Scheduling of pods in current setup doesn't appear to guarantee N+1 fault tolerance

lazypower commented 8 years ago

migrated from an email message from Eddy Truyen

I have an openstack environment with 3 kubernetes units, and a redis service deployed with 4 replicated pods

ubuntu@juju-openstack-machine-5:~$ kubectl get nodes
NAME            LABELS                                 STATUS    AGE
172.17.13.141   kubernetes.io/hostname=172.17.13.141   Ready     1d
172.17.13.144   kubernetes.io/hostname=172.17.13.144   Ready     1d
172.17.13.145   kubernetes.io/hostname=172.17.13.145   Ready     1d

ubuntu@juju-openstack-machine-5:~$ kubectl get services
NAME          CLUSTER_IP    EXTERNAL_IP   PORT(S)     SELECTOR                            AGE
kubernetes    10.1.0.1      <none>        443/TCP     <none>                              1d
redis-slave   10.1.2.166    <none>        6379/TCP    app=redis,role=slave,tier=backend   1d

ubuntu@juju-openstack-machine-5:~$ kubectl get rc
CONTROLLER    CONTAINER(S)   IMAGE(S)                                 SELECTOR                            REPLICAS   AGE
redis-slave   slave          gcr.io/google_samples/gb-redisslave:v1   app=redis,role=slave,tier=backend   4          1d

First when I suspend 2 machines in my openstack dashboard, the migration of the pods goes really slow. Eventually they were all placed on the remaining machine, where the kubernetes master is located.

Then, when I resume the two suspended machines in my openstack dashboard, the pods are not replicated across different machines to ensure fault tolerance of services I thought the specification of scheduling entails that when you create a service and thereafter a replicacontroller, pods should always be placed on different machines if that is possible However, this behavior does not occur. Below you find some command output for the redis-slave service which have 4 replicated pods. It shows that all 4 pods are placed on the same machine. In attachment you also find the pod description

ubuntu@juju-openstack-machine-5:~$ kubectl describe pod -l app=redis | grep Node
Node:                           172.17.13.144/172.17.13.144
Node:                           172.17.13.144/172.17.13.144
Node:                           172.17.13.144/172.17.13.144
Node:                           172.17.13.144/172.17.13.144

After this experiment juju also reported some problems with lost agents (see attached fle juju status.txt and juju debug-log.txt) Then running the command 'juju status' a second time, resulted in a deadlock, at least nothing happening anymore

poddescription.txt juju status.txt juju debug-log.txt

eddytruyen commented 8 years ago

I found out that the reported behavior of Kubernetes-core also appears in the commercial service google container engine. See attached RAW file.

Below I describe the findings

This was my initial desired state:

[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME                               STATUS    AGE
gke-cluster-1-09071288-node-7mmj   Ready     1m
gke-cluster-1-09071288-node-buym   Ready     1m
gke-cluster-1-09071288-node-demf   Ready     1m
[centos@public-kubernetes-gw ~]$ kubectl run nginx --image nginx --port=80
deployment "nginx" created
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node
Node:           gke-cluster-1-09071288-node-7mmj/10.128.0.2
[centos@public-kubernetes-gw ~]$ kubectl scale deployment nginx --replicas=0
deployment "nginx" scaled
[centos@public-kubernetes-gw ~]$ kubectl expose deployment nginx
service "nginx" exposed
[centos@public-kubernetes-gw ~]$ kubectl scale deployment nginx --replicas=3
deployment "nginx" scaled
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node
Node:           gke-cluster-1-09071288-node-7mmj/10.128.0.2
Node:           gke-cluster-1-09071288-node-demf/10.128.0.4
Node:           gke-cluster-1-09071288-node-buym/10.128.0.3

Then I suspended 2 machines via the google cloud engine dashboard. These machines were automatically restarted by google container engine

At a certain moment I reached the following state:

[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME                               STATUS    AGE
gke-cluster-1-09071288-node-7mmj   Ready     1m
gke-cluster-1-09071288-node-buym   Ready     4m
gke-cluster-1-09071288-node-demf   Ready     22m
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node:
Node:           gke-cluster-1-09071288-node-demf/10.128.0.4
Node:           gke-cluster-1-09071288-node-demf/10.128.0.4
Node:           gke-cluster-1-09071288-node-demf/10.128.0.4

Then google container engine's behavior diverted from the kubernetes-core bundle. In fact google container engine did something very surprising: It suspended the machine where all 3 replicas were placed:

[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME                               STATUS     AGE
gke-cluster-1-09071288-node-7mmj   Ready      5m
gke-cluster-1-09071288-node-buym   Ready      8m
gke-cluster-1-09071288-node-demf   NotReady   26m
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node:
Node:           gke-cluster-1-09071288-node-demf/10.128.0.4
Node:           gke-cluster-1-09071288-node-demf/10.128.0.4
Node:           gke-cluster-1-09071288-node-demf/10.128.0.4

At this point the nginx service was unavailable:

[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep IP:
IP:
IP:
IP:

I think google container engine suspended 10.128.0.4 in order to trigger the replication controller (aka deployment manager) to redistribute the pods. Indeed after some time

[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep IP:
IP:             10.0.1.2
IP:             10.0.1.3
IP:             10.0.0.4
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node:
Node:           gke-cluster-1-09071288-node-7mmj/10.128.0.2
Node:           gke-cluster-1-09071288-node-7mmj/10.128.0.2
Node:           gke-cluster-1-09071288-node-buym/10.128.0.3
[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME                               STATUS     AGE
gke-cluster-1-09071288-node-7mmj   Ready      10m
gke-cluster-1-09071288-node-buym   Ready      13m
gke-cluster-1-09071288-node-demf   NotReady   31m

googlecontainerengine.txt

eddytruyen commented 8 years ago

So I first thought this is an issue with Kubernetes. In fact, it is not. It is default behavior of the Kubernetes scheduler, but this default behavior can be customized. e.g. pod placement can be controlled via node selection.

eddytruyen commented 8 years ago

I am getting excited about this problem. So it's time for me to put it aside and focus on something else otherwise I start spending too much time in it.

In summary, after a night of sleep, I think there is an issue with Kubernetes in the Kubernetes-core after all. I had a look at how the scheduler of kubernetes is designed It is policy-configurable and extensible.

The default policy of the scheduler corresponds with the policy below. As you can see, the scheduler policy would definitively change the situation where all pods of a service run a single node and the other nodes are idle. So the question is: why is the scheduler never invoked to change that situation in the Kubernetes-core bundle?

In Google Container Engine scenario, they take a way too drastic approach to trigger the scheduler: Shutting down that single node in order to let the scheduler become active? It's like when trying to resolve a traffic jam on a highway, you hope traffic distributes across secondary routes. In order to achieve that, you use road signs, you are not going to shutdown the highway. ;)

{
    "kind" : "Policy",
    "version" : "v1",
    "predicates" : [
        {"name" : "PodFitsPorts"},
        {"name" : "PodFitsResources"},
        {"name" : "NoDiskConflict"},
        {"name" : "MatchNodeSelector"},
        {"name" : "HostName"}
    ],
    "priorities" : [
        {"name" : "LeastRequestedPriority", "weight" : 1},
        {"name" : "BalancedResourceAllocation", "weight" : 1},
        {"name" : "ServiceSpreadingPriority", "weight" : 1}
    ]
}

lazypower commented 8 years ago

Great investigative work. I haven't had a chance to dive into this, but we will certainly circle back and evaluate your findings.

Thanks a ton for the detailed writeup here :)

eddytruyen commented 8 years ago

This issue has been added as a requirements for the work on a kubernetes rescheduler.

See: https://github.com/kubernetes/kubernetes/issues/12140

So I think this issue can be closed here.

mbruzek / layer-k8s

Scheduling of pods in current setup doesn't appear to guarantee N+1 fault tolerance #24