Open lazypower opened 8 years ago
I found out that the reported behavior of Kubernetes-core also appears in the commercial service google container engine. See attached RAW file.
Below I describe the findings
This was my initial desired state:
[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME STATUS AGE
gke-cluster-1-09071288-node-7mmj Ready 1m
gke-cluster-1-09071288-node-buym Ready 1m
gke-cluster-1-09071288-node-demf Ready 1m
[centos@public-kubernetes-gw ~]$ kubectl run nginx --image nginx --port=80
deployment "nginx" created
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node
Node: gke-cluster-1-09071288-node-7mmj/10.128.0.2
[centos@public-kubernetes-gw ~]$ kubectl scale deployment nginx --replicas=0
deployment "nginx" scaled
[centos@public-kubernetes-gw ~]$ kubectl expose deployment nginx
service "nginx" exposed
[centos@public-kubernetes-gw ~]$ kubectl scale deployment nginx --replicas=3
deployment "nginx" scaled
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node
Node: gke-cluster-1-09071288-node-7mmj/10.128.0.2
Node: gke-cluster-1-09071288-node-demf/10.128.0.4
Node: gke-cluster-1-09071288-node-buym/10.128.0.3
Then I suspended 2 machines via the google cloud engine dashboard. These machines were automatically restarted by google container engine
At a certain moment I reached the following state:
[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME STATUS AGE
gke-cluster-1-09071288-node-7mmj Ready 1m
gke-cluster-1-09071288-node-buym Ready 4m
gke-cluster-1-09071288-node-demf Ready 22m
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node:
Node: gke-cluster-1-09071288-node-demf/10.128.0.4
Node: gke-cluster-1-09071288-node-demf/10.128.0.4
Node: gke-cluster-1-09071288-node-demf/10.128.0.4
Then google container engine's behavior diverted from the kubernetes-core bundle. In fact google container engine did something very surprising: It suspended the machine where all 3 replicas were placed:
[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME STATUS AGE
gke-cluster-1-09071288-node-7mmj Ready 5m
gke-cluster-1-09071288-node-buym Ready 8m
gke-cluster-1-09071288-node-demf NotReady 26m
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node:
Node: gke-cluster-1-09071288-node-demf/10.128.0.4
Node: gke-cluster-1-09071288-node-demf/10.128.0.4
Node: gke-cluster-1-09071288-node-demf/10.128.0.4
At this point the nginx service was unavailable:
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep IP:
IP:
IP:
IP:
I think google container engine suspended 10.128.0.4 in order to trigger the replication controller (aka deployment manager) to redistribute the pods. Indeed after some time
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep IP:
IP: 10.0.1.2
IP: 10.0.1.3
IP: 10.0.0.4
[centos@public-kubernetes-gw ~]$ kubectl describe pods -l run | grep Node:
Node: gke-cluster-1-09071288-node-7mmj/10.128.0.2
Node: gke-cluster-1-09071288-node-7mmj/10.128.0.2
Node: gke-cluster-1-09071288-node-buym/10.128.0.3
[centos@public-kubernetes-gw ~]$ kubectl get nodes
NAME STATUS AGE
gke-cluster-1-09071288-node-7mmj Ready 10m
gke-cluster-1-09071288-node-buym Ready 13m
gke-cluster-1-09071288-node-demf NotReady 31m
So I first thought this is an issue with Kubernetes. In fact, it is not. It is default behavior of the Kubernetes scheduler, but this default behavior can be customized. e.g. pod placement can be controlled via node selection.
I am getting excited about this problem. So it's time for me to put it aside and focus on something else otherwise I start spending too much time in it.
In summary, after a night of sleep, I think there is an issue with Kubernetes in the Kubernetes-core after all. I had a look at how the scheduler of kubernetes is designed It is policy-configurable and extensible.
The default policy of the scheduler corresponds with the policy below. As you can see, the scheduler policy would definitively change the situation where all pods of a service run a single node and the other nodes are idle. So the question is: why is the scheduler never invoked to change that situation in the Kubernetes-core bundle?
In Google Container Engine scenario, they take a way too drastic approach to trigger the scheduler: Shutting down that single node in order to let the scheduler become active? It's like when trying to resolve a traffic jam on a highway, you hope traffic distributes across secondary routes. In order to achieve that, you use road signs, you are not going to shutdown the highway. ;)
{
"kind" : "Policy",
"version" : "v1",
"predicates" : [
{"name" : "PodFitsPorts"},
{"name" : "PodFitsResources"},
{"name" : "NoDiskConflict"},
{"name" : "MatchNodeSelector"},
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 1}
]
}
Great investigative work. I haven't had a chance to dive into this, but we will certainly circle back and evaluate your findings.
Thanks a ton for the detailed writeup here :)
This issue has been added as a requirements for the work on a kubernetes rescheduler.
See: https://github.com/kubernetes/kubernetes/issues/12140
So I think this issue can be closed here.
migrated from an email message from Eddy Truyen
I have an openstack environment with 3 kubernetes units, and a redis service deployed with 4 replicated pods
First when I suspend 2 machines in my openstack dashboard, the migration of the pods goes really slow. Eventually they were all placed on the remaining machine, where the kubernetes master is located.
Then, when I resume the two suspended machines in my openstack dashboard, the pods are not replicated across different machines to ensure fault tolerance of services I thought the specification of scheduling entails that when you create a service and thereafter a replicacontroller, pods should always be placed on different machines if that is possible However, this behavior does not occur. Below you find some command output for the redis-slave service which have 4 replicated pods. It shows that all 4 pods are placed on the same machine. In attachment you also find the pod description
After this experiment juju also reported some problems with lost agents (see attached fle juju status.txt and juju debug-log.txt) Then running the command 'juju status' a second time, resulted in a deadlock, at least nothing happening anymore
poddescription.txt juju status.txt juju debug-log.txt