Closed faresj closed 6 years ago
cc @kubernetes/sig-scheduling-misc
Just for curiosity: did you check if the behavior is the same with latest stable version (1.5)?
no, i did not, i wanted to confirm that i am doing the right thing with the affinity rule above first, there was no examples so for,....i can try with 1.5. but would be nice to see if i the affinity rules above are sound
cc/ @kevin-wangzefeng who may have time to look at this before I can
(Sorry, hit close button by accident!)
When converted, the affinity/anti-affinity rules are:
{
"podAffinity": {
"preferredDuringSchedulingIgnoredDuringExecution": [{
"weight":100,
"podAffinityTerm": {
"labelSelector": {
"matchExpressions": [{
"key": "pod_label_xyz",
"operator": "Exists"
}, {
"key": "pod_label_xyz",
"operator": "In",
"values": ["value-a"]
}]
},
"namespaces": ["sspni-882-frj"],
"topologyKey": "kubernetes.io/hostname"
}
}]
},
"podAntiAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [{
"labelSelector": {
"matchExpressions": [{
"key": "pod_label_xyz",
"operator": "Exists"
}, {
"key": "pod_label_xyz",
"operator": "NotIn",
"values": ["value-a"]
}]
},
"namespaces": ["sspni-882-frj"],
"topologyKey": "kubernetes.io/hostname"
}],
"preferredDuringSchedulingIgnoredDuringExecution": [{
"weight":100,
"podAffinityTerm": {
"labelSelector": {
"matchExpressions": [{
"key": "pod_label_xyz",
"operator": "DoesNotExist"
}]
},
"namespaces": ["sspni-882-frj"],
"topologyKey": "kubernetes.io/hostname"
}
}]
}
}
According to your rules, the semantics are:
affinity:
soft: prefer to schedule onto the same node with pods that have label key "pod_label_xyz", and have label "pod_label_xyz=value-a"
anti-affinity:
hard: don't schedule onto the node with pods that have label key "pod_label_xyz", and have label "pod_label_xyz!=value-a"
soft: prefer not schedule onto the node with pods that don't have label key "pod_label_xyz"
which are actually a little bit different with your requirement described. Pods with such rules would just stay together with pods that have label "pod_label_xyz=value-a".
As for the case you described, let me rephrase here. There are 3 kind of RC with pod labels like below:
RC | pod label |
---|---|
RC-a | pod_label_xyz=value-a |
RC-b | pod_label_xyz=value-b |
RC-c | pod_label_xyz=value-c |
And you have 5 nodes, let's say Node1,Node2,Node3,Node4,Node5, the and pods of RCs, are distrbuting like: Node1, Pod-a1 Node2, Pod-b1 Node3, Pod-c1
Then, when you trying to start a Pod with the affinity/anti-affinity rules shown above, it will be scheduled on to node Node1, and if you start even more this kind of pods, they will still all go onto Node1, to satisfy the hard anti-affinity requirement.
Do you expect to get the Pod able to be running on Node1, Node4 and Node5? If so, you need to remove the expression "label key pod_label_xyz exists" from the hard anti-affinity rule.
Thank you @kevin-wangzefeng for the above explanation, but this is not the behavior i am seeing in the test.
Indeed I do expect the RC-a pod to run on Node1, Node4 and Node5 (as long as Node4 and Node5 do not run any pods with pod_label_xyz!=value-a), I confirm from my testing that with the rules shown above, it does so indeed...
My main problem is that when i scale up RC-a it seems to favour creating new POD-a on Node4 and/or Node5 even though Node1 still have enough CPU/Mem.... and according to the soft affinity&anti-affinity rules
Please note: the reason i had to add the expression "label key pod_label_xyz exists" to the hard anti-affinity rule was that NOT doing so was causing the PODs to not run on any node where another pod is running with no label pod_label_xyz what so ever.... it took me some time to figure this out, but it is actually aligned with the documentation under https://github.com/kubernetes/community/blob/master/contributors/design-proposals/podaffinity.md#anti-affinity ...
Note that this works because "service" NotIn "S" matches pods with no key "service" as well as pods with key "service" and a corresponding value that is not "S."
...this means that if i only had "pod_label_xyz NotIn value-a" in my anti-affinity hard rule, then the rule will exclude nodes where pods are already running where "pod_label_xyz!=value-a" as well as nodes where pods with no pod_label_xyz are..... which is not desired.
At this moment, all the requiredDuringSchedulingIgnoredDuringExecution rules seems to be working well as desired with the rule above...the issue is really that the soft preferredDuringSchedulingIgnoredDuringExecution seems to be completely ignored.
For the record, what i am trying to do is to have Bin Stack logic for a subset of services labeled with different pools of pod_label_xyz:
Thank you again
Please note: i tried with an even simpler example where i only had a single soft inter-pod affinity
"scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[ {\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}}"
and i confirm the behavior is the same.....when scaling up K8s ignore the soft affinity and prefer to spread out the pods across several node, instead of favoring the 1 node that already have a pod with label "pod_label_xyz=value-a .... even though there are ample resources still available
unless i miss-understood how this feature worked or miss set my soft rule (hard one work well though), then i would say issue is with K8s 1.4 ... i will try with 1.5 soon....but please confirm if my understanding is correct.
thank you
@faresj , here should be what happened for your case:
RC | pod label | Pods |
---|---|---|
RC-a | pod_label_xyz=value-a | 1 pod on Node1 |
RC-b | pod_label_xyz=value-b | 1 pod on Node2 |
RC-c | pod_label_xyz=value-c | 1 pod on Node3 |
Because of anti-affinity:pod_label_xyz exists
, Node1
, Node2
and Node3
are not included in priority phase (soft) for the new Pod: each of those nodes has one pod with pod_label_xyz
.
For your latest case (new Pod without anti-affinity), was the RC-a's pod in Node1
not deleted? If not, scheduler will also skip Node1
because it'll check existing Pod's anti-affinity constraints: the pod on Node1's anti-affnity:pod_label_xyz exist
will make Node1
not suitable for this new pod.
And for your target:
i am trying to get K8s to cluster together pods of the same service on the same node as much as possible (ie, only go to next node if not possible to put more on the node where the service is already in)
I think @kevin-wangzefeng 's suggestion will help: remove the expression "label key pod_label_xyz exists" from the hard anti-affinity rule
. BTW, please delete previous RC/Pods firstly.
@faresj , sorry, I just found that I made a mistake in my prior reply
which are actually a little bit different with your requirement described. Pods with such rules would just stay together with pods that have label "pod_label_xyz=value-a".
the hard anti-affinity rule you gave actually results in don't schedule a pod anywhere runs pods have label "pod_label_xyz!=value-a". And equally, it means the pod can be scheduled to nodes with pods have label "pod_label_xyz=value-a" or pods don't have label key "pod_label_xyz".
Besides, one more thing I'd like to mention is that soft affinity/anti-affinity requirements are usually not guaranteed -- there are several priority algorithms affect the nodes ranking.
Starting with some simple cases in comparison should be helpful to figure out is there something wrong, i.e. the pods distribution without anti-affinity rules vs. with anti-affinity rules. (Please note that there is another default soft anti-affinity that tries to separate pods from a same service to different nodes.)
You may also paste the YAMLs of the RC/Pod, commands etc. that you used in the test, in case missing anything important in your steps.
the hard anti-affinity rule you gave actually results in don't schedule a pod anywhere runs pods have label "pod_label_xyz!=value-a". And equally, it means the pod can be scheduled to nodes with pods have label "pod_label_xyz=value-a" or pods don't have label key "pod_label_xyz".
@kevin-wangzefeng , I think the hard anti-affntiy "pod_label_xyz exists" make scheduler can not dispatch new Pod of RC-a to Node1, similar to Node2 & Node3. So the new pod was dispatched to Node4 & Node5.
@k82cn & @kevin-wangzefeng thank you for the replies.
@k82cn, Regarding
Because of anti-affinity:pod_label_xyz exists, Node1, Node2 and Node3 are not included in priority phase (soft) for the new Pod: each of those nodes has one pod with pod_label_xyz.
What i am seeing is that when i increase replicas to 2 in RC-a, a new pod is indeed scheduled on Node1, but when i increase it to 3, 4, 5...etc, then it goes to Node4 then Node5 and then Node1 again,
I also confirm that in all my testing i delete the entire RC and all pods before proceeded. I also confirm that
"scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[ {\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}}"
you can see below the section of the manifest of the RC-a, with the simplest affinity rule.
{
"apiVersion": "v1",
"spec": {
"template": {
"metadata": {
"annotations": {
"pod.beta.kubernetes.io/init-containers": "[{ \"name\": \"pack-dl-init\", \"image\": \"mobi-registry.nuance.com:5000/build-ncs/packdownload:1.0.0.21\",\"env\":[{ \"name\":\"DESTINATION_PACK_PATH\",\"value\":\"/tmp/deleteme\"},{\"name\":\"PACKS_LIST\",\"value\":\"mrecs3config:s3config.ncs63.r6.127086;mrecdatapack:ssa-evp-PBX.ncs63.r1.UNKNOWN.126007\"},{ \"name\":\"ARTIFACTORY_BASE_URL\",\"value\":\"http://10.10.10.10:8081/artifactory/ncs-packs\"},{ \"name\":\"CLEAN_UNUSED_PACKS\",\"value\":\"True\"},{ \"name\":\"WAIT_INFINITELY_ONCE_DONE\",\"value\":\"False\"},{\"name\":\"ARTIFACTORY_API_SEARCH_URL\",\"value\":\"http://10.10.10.10:8081/artifactory/api/search/artifact?name=\"}], \"volumeMounts\":[{\"name\":\"packs-working-stages\",\"mountPath\":\"/tmp/deleteme\"}], \"command\": [\"/bin/sh\", \"-c\", \"python3 /root/run.py\"]\n} ]",
"sspni-info-scheduling-explanation-pbss_mtps": "Bin-Stacking: MUST EXCLUDE nodes where a DIFFERENT pod_label_xyz already exists and is running, AND try to pick a node where the same pod_label_xyz is already running if possible and try NOT to pick a node where no pod_label_xyz wahtsoever is running.",
"scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[ {\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}}"
},
"labels": {
"pod_label_xyz": "value-a",
"name": "value-a",
"app": "value-a"
}
},
"spec": {
"nodeSelector": {
"sspni-882-frj.node_pool": "generic"
},
"hostPID": true,
"volumes": [
{
"name": "log-directory",
"hostPath": {
"path": "/var/opt/nuance/kubernetes/volumes/sspni-882-frj/logs"
}
},
{
"name": "packs-working-stages",
"hostPath": {
"path": "/var/opt/nuance/kubernetes/volumes/sspni-882-frj/pools-working-stages/value-a/STAGE-A/pbss"
}
},
{
"name": "memory",
"hostPath": {
"path": "/dev/shm/sspni-882-frj/value-a"
}
},
{
"name": "cache",
"emptyDir": {
"medium": "Memory"
}
}
],
"containers": [
{
"readinessProbe": {
"timeoutSeconds": 5,
"initialDelaySeconds": 30,
"httpGet": {
"path": "/status/isReady",
"port": 4700
}
},
"name": "value-a",
"env": [
{
"name": "POD_ID",
"valueFrom": {
"fieldRef": {
"fieldPath": "metadata.name"
}
}
},
{
"value": "",
"name": "APP_CONFIG"
}
],
"livenessProbe": {
"timeoutSeconds": 60,
"initialDelaySeconds": 900,
"httpGet": {
"path": "/status/isAlive",
"port": 4700
}
},
"image": "mobi-registry.nuance.com:5000/build-ncs/speech-server-s3:7.0.000.61",
"volumeMounts": [
{
"readOnly": false,
"name": "memory",
"mountPath": "/dev/shm"
},
{
"readOnly": true,
"name": "packs-working-stages",
"mountPath": "/var/opt/nuance/ncs/datapacks"
},
{
"readOnly": false,
"name": "cache",
"mountPath": "/opt/nuance/ncs/speech-server/var/cache/cache"
},
{
"readOnly": false,
"name": "log-directory",
"mountPath": "/var/log/nuance/ncs/logs"
}
],
"ports": [
{
"containerPort": 4700
},
{
"containerPort": 4600
},
{
"containerPort": 4500
},
{
"containerPort": 4400
},
{
"containerPort": 4300
},
{
"protocol": "UDP",
"containerPort": 4200
}
],
"args": [
""
],
"resources": {
"requests": {
"cpu": "1"
},
"limits": {
"cpu": "2"
}
}
}
],
"terminationGracePeriodSeconds": 900
}
},
"selector": {
"name": "value-a",
"app": "value-a"
},
"replicas": 1
},
"kind": "ReplicationController",
"metadata": {
"namespace": "sspni-882-frj",
"annotations": {
"deployed_pool_current_working_stage_criteria": "mrecs3config:s3config.ncs63.r6.127086;mrecdatapack:ssa-evp-PBX.ncs63.r1.UNKNOWN.126007",
"deployed_pool_current_working_stage_force_cleanup": "false",
"deployment_order": "10",
"deploy_timeout_secs": "613",
"deployed_pool_directory": "pbss",
"deployed_pool_unique_id": "value-a",
"deployed_pool_current_working_stage_id": "STAGE-A"
},
"name": "value-a-0e9d46b12dd38537911be29b53ad3b52",
"labels": {
"name": "value-a",
"app": "value-a"
}
}
}
@kevin-wangzefeng, Regarding this previous comment from
Besides, one more thing I'd like to mention is that soft affinity/anti-affinity requirements are usually not guaranteed -- there are several priority algorithms affect the nodes ranking....
Please note that there is another default soft anti-affinity that tries to separate pods from a same service to different nodes
I figured that there may be other more important default ranking rules that are working against my objective of clustering the pods of a particular service on a node once a single pod of that service is already there....Is there any way to disable/control/override/over-power this in K8s?
Basically, is there any way to get the behavior i desire in K8s (to cluster together services of the same type on a node before expanding to new nodes) using the affinity/anti-affinity rule feature?
Side notes:
Thank you for your help.
Regards
What i am seeing is that when i increase replicas to 2 in RC-a, a new pod is indeed scheduled on Node1, but when i increase it to 3, 4, 5...etc, then it goes to Node4 then Node5 and then Node1 again,
Interesting, let me try your profile to see what happened :).
Hey @k82cn , had you had the chance to look at Fares' Profile? We are in the process of upgrading our Dev cluster to 1.5.3 to get somewhat closers to the official release. This should reduce the variables here for this investigation.
Thanks.
@djsly , did not get chance to check it yet. Please try 1.5.3 and let me know if anything I can help then :).
I've the same issue on 1.5.2, my deployment configuration is
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: ingress-nginx
namespace: default
labels:
k8s-addon: ingress-nginx.addons.k8s.io
spec:
replicas: 3
template:
metadata:
labels:
app: ingress-nginx
k8s-addon: ingress-nginx.addons.k8s.io
annotations:
scheduler.alpha.kubernetes.io/affinity: >
{
"podAntiAffinity": {
"preferredDuringSchedulingIgnoredDuringExecution": [{
"labelSelector": {
"matchExpressions": [
{ "key": "app", "operator": "In", "values": ["ingress-nginx"] }
]
},
"topologyKey": "kubernetes.io/hostname",
"weight": 1
}]
}
}
spec:
terminationGracePeriodSeconds: 60
containers:
- image: gcr.io/google_containers/nginx-ingress-controller:0.8.3
name: ingress-nginx
imagePullPolicy: Always
ports:
- name: http
containerPort: 80
protocol: TCP
- name: https
containerPort: 443
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/nginx-default-backend
- --nginx-configmap=$(POD_NAMESPACE)/ingress-nginx
trying to spread the nginx ingress containers across hosts, however after the update 2 pods has been scheduled on the same node
Update: after changing from preferred
to required
all the pods failed at first with pod failed to fit in any node fit failure summary on nodes : MatchInterPodAffinity (3), PodToleratesNodeTaints (1)
.
Then once every 30-40 seconds the containers started each one in a different node correctly, but still, preferred
should work because with required I can't schedule more pods than the number of nodes
Hi So update on this: I FINALLY GOT THE BEHAVIOR DESIRED... : )
I am still using K8s 1.4.8 (will test to confirm on 1.5.3 soon), it seems i needed to put more weight on "InterPodAffinityPriority" VS "SelectorSpreadPriority"...as per https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler.md#modifying-policies
I changed the /etc/kubernetes/scheduler to pass
KUBE_SCHEDULER_ARGS="--leader-elect=true --policy-config-file=/frj_scheduler_policy.json --feature-gates=AllAlpha=true"
where frj_scheduler_policy.json content
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "NoDiskConflict"},
{"name" : "GeneralPredicates"},
{"name" : "PodToleratesNodeTaints"},
{"name" : "CheckNodeMemoryPressure"},
{"name" : "CheckNodeDiskPressure"},
{"name" : "NoVolumeZoneConflict"},
{"name" : "MatchInterPodAffinity"},
{"name" : "PodFitsHostPorts"},
{"name" : "PodFitsResources"},
{"name" : "MatchNodeSelector"},
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000},
{"name" : "NodeAffinityPriority", "weight" : 1},
{"name" : "TaintTolerationPriority", "weight" : 1},
{"name" : "SelectorSpreadPriority", "weight" : 1},
{"name" : "InterPodAffinityPriority", "weight" : 1500}
]
}
My test now show that pods of RC-a are always added to the same node where RC-a pod is already running, AND only when the capacity of CPU is exceeded the next pod will either flote (not be scheduled) or get scheduled on a node where no other RC-b or RC-c pods are running.....
Note: This is with the original affinity/antiaffinity rule:
"scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[{\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"Exists\"},{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]} , \"podAntiAffinity\":{\"requiredDuringSchedulingIgnoredDuringExecution\":[{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"Exists\"},{\"key\":\"pod_label_xyz\",\"operator\":\"NotIn\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}] , \"preferredDuringSchedulingIgnoredDuringExecution\":[{\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"DoesNotExist\"}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}] }}"
This now mimics the same weight attributed to the predicates and priorities in the code (https://github.com/kubernetes/kubernetes/blob/03837fe6075590aea3a71d36a93b14cf9ce8e7b3/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go), except for the change of the weight on "InterPodAffinityPriority"which is not exercised by anyone else for now.....
The obvious concerns is:
Question: Is there any reason not to have the "InterPodAffinityPriority" weight set to higher value by default? it seems more appropriate for cases where someone wants to explicitly override the "spread out" policy...and harmless if no affinity is set. Thank you all for the help!
@faresj
Question: Is there any reason not to have the "InterPodAffinityPriority" weight set to higher value by default? it seems more appropriate for cases where someone wants to explicitly override the "spread out" policy...and harmless if no affinity is set.
It's just we don't have much background data as inputs to set a higher/proper default value of the weight for "InterPodAffinityPriority" comparing to other priorities.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta
.
/lifecycle rotten
/remove-lifecycle stale
Hello, I am experimenting with K8s 1.4 pod affinity-antiaffinity, i am trying to get K8s to cluster together pods of the same service on the same node as much as possible (ie, only go to next node if not possible to put more on the node where the service is already in) To do so, i setup
1- a hard (requiredDuringScheduling) anti-affinity to exclude running where a diffrent service is already running (pod_label_xyz not in [value-a]) 2- a soft (preferredDuringScheduling) affinity to try to run where the same service (pod_label_xyz in [value-a]) - weight 100 3- a soft (preferredDuringScheduling) anti-affinity to try not to run where the same service is not already running (pod_label_xyz does not exist) - weight 100
what i see from testing is that when having 5 nodes and 3 services pod_label_xyz & (value-a, value-b, value-c) with 1 pod each created with a replication controller, is that the first pods get scheduled properly and when scaling up any of them the 1st hard rule is respected by K8s....BUT the 2nd and 3rd soft rules (which is actually redundant of the 2nd) is not respected....I see that when i scale up K8s tries to push pods to empty node (not used by any other service) even though there is capacity to schedule more where the service is already running. in fact if i scale up even more, new pods get created on the original node as well as the new (previously unused nodes)
Please advise if i am missing something
Thank you
Here is the annotation i used