K8s pods affinity & anti-affinity, soft (preferredDuringScheduling) not respected in 1.4?

faresj commented 7 years ago

Hello, I am experimenting with K8s 1.4 pod affinity-antiaffinity, i am trying to get K8s to cluster together pods of the same service on the same node as much as possible (ie, only go to next node if not possible to put more on the node where the service is already in) To do so, i setup

1- a hard (requiredDuringScheduling) anti-affinity to exclude running where a diffrent service is already running (pod_label_xyz not in [value-a]) 2- a soft (preferredDuringScheduling) affinity to try to run where the same service (pod_label_xyz in [value-a]) - weight 100 3- a soft (preferredDuringScheduling) anti-affinity to try not to run where the same service is not already running (pod_label_xyz does not exist) - weight 100

what i see from testing is that when having 5 nodes and 3 services pod_label_xyz & (value-a, value-b, value-c) with 1 pod each created with a replication controller, is that the first pods get scheduled properly and when scaling up any of them the 1st hard rule is respected by K8s....BUT the 2nd and 3rd soft rules (which is actually redundant of the 2nd) is not respected....I see that when i scale up K8s tries to push pods to empty node (not used by any other service) even though there is capacity to schedule more where the service is already running. in fact if i scale up even more, new pods get created on the original node as well as the new (previously unused nodes)

Please advise if i am missing something

Thank you

Here is the annotation i used

"scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[{\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"Exists\"},{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}          ,          \"podAntiAffinity\":{\"requiredDuringSchedulingIgnoredDuringExecution\":[{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"Exists\"},{\"key\":\"pod_label_xyz\",\"operator\":\"NotIn\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}]          ,          \"preferredDuringSchedulingIgnoredDuringExecution\":[{\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"DoesNotExist\"}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]   }}",

foxish commented 7 years ago

cc @kubernetes/sig-scheduling-misc

GheRivero commented 7 years ago

Just for curiosity: did you check if the behavior is the same with latest stable version (1.5)?

faresj commented 7 years ago

no, i did not, i wanted to confirm that i am doing the right thing with the affinity rule above first, there was no examples so for,....i can try with 1.5. but would be nice to see if i the affinity rules above are sound

davidopp commented 7 years ago

cc/ @kevin-wangzefeng who may have time to look at this before I can

davidopp commented 7 years ago

(Sorry, hit close button by accident!)

kevin-wangzefeng commented 7 years ago

When converted, the affinity/anti-affinity rules are:

{
    "podAffinity": {
        "preferredDuringSchedulingIgnoredDuringExecution": [{
            "weight":100,
            "podAffinityTerm": {
                "labelSelector": {
                    "matchExpressions": [{
                        "key": "pod_label_xyz",
                        "operator": "Exists"
                    }, {
                        "key": "pod_label_xyz",
                        "operator": "In",
                        "values": ["value-a"]
                    }]
                },
                "namespaces": ["sspni-882-frj"],
                "topologyKey": "kubernetes.io/hostname"
            }
        }]
    },
    "podAntiAffinity": {
        "requiredDuringSchedulingIgnoredDuringExecution": [{
            "labelSelector": {
                "matchExpressions": [{
                    "key": "pod_label_xyz",
                    "operator": "Exists"
                }, {
                    "key": "pod_label_xyz",
                    "operator": "NotIn",
                    "values": ["value-a"]
                }]
            },
            "namespaces": ["sspni-882-frj"],
            "topologyKey": "kubernetes.io/hostname"
        }],
        "preferredDuringSchedulingIgnoredDuringExecution": [{
            "weight":100,
            "podAffinityTerm": {
                "labelSelector": {
                    "matchExpressions": [{
                        "key": "pod_label_xyz",
                        "operator": "DoesNotExist"
                    }]
                },
                "namespaces": ["sspni-882-frj"],
                "topologyKey": "kubernetes.io/hostname"
            }
        }]
    }
}

According to your rules, the semantics are:

affinity:

soft: prefer to schedule onto the same node with pods that have label key "pod_label_xyz", and have label "pod_label_xyz=value-a"

anti-affinity:

hard: don't schedule onto the node with pods that have label key "pod_label_xyz", and have label "pod_label_xyz!=value-a"

soft: prefer not schedule onto the node with pods that don't have label key "pod_label_xyz"

which are actually a little bit different with your requirement described. Pods with such rules would just stay together with pods that have label "pod_label_xyz=value-a".

As for the case you described, let me rephrase here. There are 3 kind of RC with pod labels like below:

RC	pod label
RC-a	pod_label_xyz=value-a
RC-b	pod_label_xyz=value-b
RC-c	pod_label_xyz=value-c

And you have 5 nodes, let's say Node1,Node2,Node3,Node4,Node5, the and pods of RCs, are distrbuting like: Node1, Pod-a1 Node2, Pod-b1 Node3, Pod-c1

Then, when you trying to start a Pod with the affinity/anti-affinity rules shown above, it will be scheduled on to node Node1, and if you start even more this kind of pods, they will still all go onto Node1, to satisfy the hard anti-affinity requirement.

Do you expect to get the Pod able to be running on Node1, Node4 and Node5? If so, you need to remove the expression "label key pod_label_xyz exists" from the hard anti-affinity rule.

faresj commented 7 years ago

Thank you @kevin-wangzefeng for the above explanation, but this is not the behavior i am seeing in the test.

Indeed I do expect the RC-a pod to run on Node1, Node4 and Node5 (as long as Node4 and Node5 do not run any pods with pod_label_xyz!=value-a), I confirm from my testing that with the rules shown above, it does so indeed...

My main problem is that when i scale up RC-a it seems to favour creating new POD-a on Node4 and/or Node5 even though Node1 still have enough CPU/Mem.... and according to the soft affinity&anti-affinity rules

Node1 should have been favored...since it already have a pod with label (pod_label_xyz=value-a).
Node4 and Node5 should have been less favored .... since they do not have pod label key "pod_label_xyz"

Please note: the reason i had to add the expression "label key pod_label_xyz exists" to the hard anti-affinity rule was that NOT doing so was causing the PODs to not run on any node where another pod is running with no label pod_label_xyz what so ever.... it took me some time to figure this out, but it is actually aligned with the documentation under https://github.com/kubernetes/community/blob/master/contributors/design-proposals/podaffinity.md#anti-affinity ...

Note that this works because  "service" NotIn "S"  matches pods with no key "service" as well as pods with key "service" and a corresponding value that is not "S."

...this means that if i only had "pod_label_xyz NotIn value-a" in my anti-affinity hard rule, then the rule will exclude nodes where pods are already running where "pod_label_xyz!=value-a" as well as nodes where pods with no pod_label_xyz are..... which is not desired.

At this moment, all the requiredDuringSchedulingIgnoredDuringExecution rules seems to be working well as desired with the rule above...the issue is really that the soft preferredDuringSchedulingIgnoredDuringExecution seems to be completely ignored.

For the record, what i am trying to do is to have Bin Stack logic for a subset of services labeled with different pools of pod_label_xyz:

Hard: all pods with pod_label_xyz=service-a be created on nodes where no other pod_label_xyz!=service-a is running (...but its ok if other pod that do not have pod_label_xyz is already there)
Soft: favor to use all capacity as much as possible on node where pod_label_xyz=service-a is already running before spreading out to another node.

Thank you again

faresj commented 7 years ago

Please note: i tried with an even simpler example where i only had a single soft inter-pod affinity

"scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[ {\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}}"

and i confirm the behavior is the same.....when scaling up K8s ignore the soft affinity and prefer to spread out the pods across several node, instead of favoring the 1 node that already have a pod with label "pod_label_xyz=value-a .... even though there are ample resources still available

unless i miss-understood how this feature worked or miss set my soft rule (hard one work well though), then i would say issue is with K8s 1.4 ... i will try with 1.5 soon....but please confirm if my understanding is correct.

thank you

k82cn commented 7 years ago

@faresj , here should be what happened for your case:

RC	pod label	Pods
RC-a	pod_label_xyz=value-a	1 pod on Node1
RC-b	pod_label_xyz=value-b	1 pod on Node2
RC-c	pod_label_xyz=value-c	1 pod on Node3

Because of anti-affinity:pod_label_xyz exists, Node1, Node2 and Node3 are not included in priority phase (soft) for the new Pod: each of those nodes has one pod with pod_label_xyz.

For your latest case (new Pod without anti-affinity), was the RC-a's pod in Node1 not deleted? If not, scheduler will also skip Node1 because it'll check existing Pod's anti-affinity constraints: the pod on Node1's anti-affnity:pod_label_xyz exist will make Node1 not suitable for this new pod.

And for your target:

i am trying to get K8s to cluster together pods of the same service on the same node as much as possible (ie, only go to next node if not possible to put more on the node where the service is already in)

I think @kevin-wangzefeng 's suggestion will help: remove the expression "label key pod_label_xyz exists" from the hard anti-affinity rule. BTW, please delete previous RC/Pods firstly.

kevin-wangzefeng commented 7 years ago

@faresj , sorry, I just found that I made a mistake in my prior reply

which are actually a little bit different with your requirement described. Pods with such rules would just stay together with pods that have label "pod_label_xyz=value-a".

the hard anti-affinity rule you gave actually results in don't schedule a pod anywhere runs pods have label "pod_label_xyz!=value-a". And equally, it means the pod can be scheduled to nodes with pods have label "pod_label_xyz=value-a" or pods don't have label key "pod_label_xyz".

Besides, one more thing I'd like to mention is that soft affinity/anti-affinity requirements are usually not guaranteed -- there are several priority algorithms affect the nodes ranking.

Starting with some simple cases in comparison should be helpful to figure out is there something wrong, i.e. the pods distribution without anti-affinity rules vs. with anti-affinity rules. (Please note that there is another default soft anti-affinity that tries to separate pods from a same service to different nodes.)

You may also paste the YAMLs of the RC/Pod, commands etc. that you used in the test, in case missing anything important in your steps.

k82cn commented 7 years ago

the hard anti-affinity rule you gave actually results in don't schedule a pod anywhere runs pods have label "pod_label_xyz!=value-a". And equally, it means the pod can be scheduled to nodes with pods have label "pod_label_xyz=value-a" or pods don't have label key "pod_label_xyz".

@kevin-wangzefeng , I think the hard anti-affntiy "pod_label_xyz exists" make scheduler can not dispatch new Pod of RC-a to Node1, similar to Node2 & Node3. So the new pod was dispatched to Node4 & Node5.

faresj commented 7 years ago

@k82cn & @kevin-wangzefeng thank you for the replies.

@k82cn, Regarding

Because of anti-affinity:pod_label_xyz exists, Node1, Node2 and Node3 are not included in priority phase (soft) for the new Pod: each of those nodes has one pod with pod_label_xyz.

What i am seeing is that when i increase replicas to 2 in RC-a, a new pod is indeed scheduled on Node1, but when i increase it to 3, 4, 5...etc, then it goes to Node4 then Node5 and then Node1 again,

I also confirm that in all my testing i delete the entire RC and all pods before proceeded. I also confirm that

I have tested without the "label key pod_label_xyz exists" from the hard anti-affinity rule. as suggested by @kevin-wangzefeng...and the observation was the same as in my first comment (ie, RC-a pods do not grow on Node-1 before spreading out).

I have tested with only 1 soft affinity rule (no antiaffinity what so ever, not hard, not soft...just as a simple test) with a new namesapce and nothing else running on any od my nodes, and with only 1 RC-a (no RC-b or RC-c)...and even in this case when i scale RC-a the pod get distributed on all other nodes rather than cluster together on Node1.

"scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[ {\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}}"

you can see below the section of the manifest of the RC-a, with the simplest affinity rule.

{
  "apiVersion": "v1",
  "spec": {
    "template": {
      "metadata": {
        "annotations": {
          "pod.beta.kubernetes.io/init-containers": "[{ \"name\": \"pack-dl-init\", \"image\": \"mobi-registry.nuance.com:5000/build-ncs/packdownload:1.0.0.21\",\"env\":[{ \"name\":\"DESTINATION_PACK_PATH\",\"value\":\"/tmp/deleteme\"},{\"name\":\"PACKS_LIST\",\"value\":\"mrecs3config:s3config.ncs63.r6.127086;mrecdatapack:ssa-evp-PBX.ncs63.r1.UNKNOWN.126007\"},{ \"name\":\"ARTIFACTORY_BASE_URL\",\"value\":\"http://10.10.10.10:8081/artifactory/ncs-packs\"},{ \"name\":\"CLEAN_UNUSED_PACKS\",\"value\":\"True\"},{ \"name\":\"WAIT_INFINITELY_ONCE_DONE\",\"value\":\"False\"},{\"name\":\"ARTIFACTORY_API_SEARCH_URL\",\"value\":\"http://10.10.10.10:8081/artifactory/api/search/artifact?name=\"}], \"volumeMounts\":[{\"name\":\"packs-working-stages\",\"mountPath\":\"/tmp/deleteme\"}], \"command\": [\"/bin/sh\", \"-c\", \"python3 /root/run.py\"]\n} ]",
          "sspni-info-scheduling-explanation-pbss_mtps": "Bin-Stacking: MUST EXCLUDE nodes where a DIFFERENT pod_label_xyz already exists and is running, AND try to pick a node where the same pod_label_xyz is already running if possible and try NOT to pick a node where no pod_label_xyz wahtsoever is running.",
          "scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[ {\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}}"
        },
        "labels": {
          "pod_label_xyz": "value-a",
          "name": "value-a",
          "app": "value-a"
        }
      },
      "spec": {
        "nodeSelector": {
          "sspni-882-frj.node_pool": "generic"
        },
        "hostPID": true,
        "volumes": [
          {
            "name": "log-directory",
            "hostPath": {
              "path": "/var/opt/nuance/kubernetes/volumes/sspni-882-frj/logs"
            }
          },
          {
            "name": "packs-working-stages",
            "hostPath": {
              "path": "/var/opt/nuance/kubernetes/volumes/sspni-882-frj/pools-working-stages/value-a/STAGE-A/pbss"
            }
          },
          {
            "name": "memory",
            "hostPath": {
              "path": "/dev/shm/sspni-882-frj/value-a"
            }
          },
          {
            "name": "cache",
            "emptyDir": {
              "medium": "Memory"
            }
          }
        ],
        "containers": [
          {
            "readinessProbe": {
              "timeoutSeconds": 5,
              "initialDelaySeconds": 30,
              "httpGet": {
                "path": "/status/isReady",
                "port": 4700
              }
            },
            "name": "value-a",
            "env": [
              {
                "name": "POD_ID",
                "valueFrom": {
                  "fieldRef": {
                    "fieldPath": "metadata.name"
                  }
                }
              },
              {
                "value": "",
                "name": "APP_CONFIG"
              }
            ],
            "livenessProbe": {
              "timeoutSeconds": 60,
              "initialDelaySeconds": 900,
              "httpGet": {
                "path": "/status/isAlive",
                "port": 4700
              }
            },
            "image": "mobi-registry.nuance.com:5000/build-ncs/speech-server-s3:7.0.000.61",
            "volumeMounts": [
              {
                "readOnly": false,
                "name": "memory",
                "mountPath": "/dev/shm"
              },
              {
                "readOnly": true,
                "name": "packs-working-stages",
                "mountPath": "/var/opt/nuance/ncs/datapacks"
              },
              {
                "readOnly": false,
                "name": "cache",
                "mountPath": "/opt/nuance/ncs/speech-server/var/cache/cache"
              },
              {
                "readOnly": false,
                "name": "log-directory",
                "mountPath": "/var/log/nuance/ncs/logs"
              }
            ],
            "ports": [
              {
                "containerPort": 4700
              },
              {
                "containerPort": 4600
              },
              {
                "containerPort": 4500
              },
              {
                "containerPort": 4400
              },
              {
                "containerPort": 4300
              },
              {
                "protocol": "UDP",
                "containerPort": 4200
              }
            ],
            "args": [
              ""
            ],
            "resources": {
              "requests": {
                "cpu": "1"
              },
              "limits": {
                "cpu": "2"
              }
            }
          }
        ],
        "terminationGracePeriodSeconds": 900
      }
    },
    "selector": {
      "name": "value-a",
      "app": "value-a"
    },
    "replicas": 1
  },
  "kind": "ReplicationController",
  "metadata": {
    "namespace": "sspni-882-frj",
    "annotations": {
      "deployed_pool_current_working_stage_criteria": "mrecs3config:s3config.ncs63.r6.127086;mrecdatapack:ssa-evp-PBX.ncs63.r1.UNKNOWN.126007",
      "deployed_pool_current_working_stage_force_cleanup": "false",
      "deployment_order": "10",
      "deploy_timeout_secs": "613",
      "deployed_pool_directory": "pbss",
      "deployed_pool_unique_id": "value-a",
      "deployed_pool_current_working_stage_id": "STAGE-A"
    },
    "name": "value-a-0e9d46b12dd38537911be29b53ad3b52",
    "labels": {
      "name": "value-a",
      "app": "value-a"
    }
  }
}

@kevin-wangzefeng, Regarding this previous comment from

Besides, one more thing I'd like to mention is that soft affinity/anti-affinity requirements are usually not guaranteed -- there are several priority algorithms affect the nodes ranking....
Please note that there is another default soft anti-affinity that tries to separate pods from a same service to different nodes

I figured that there may be other more important default ranking rules that are working against my objective of clustering the pods of a particular service on a node once a single pod of that service is already there....Is there any way to disable/control/override/over-power this in K8s?

Basically, is there any way to get the behavior i desire in K8s (to cluster together services of the same type on a node before expanding to new nodes) using the affinity/anti-affinity rule feature?

Side notes:

For the record....i did try to repeat the same rule in my affinity 10 times (thinking that this will allow more weight to be attributed to the node at scheduling time (as each node will have the weight added to the ranking for each preferred affinity rule that it satisfies)....but that did not seem to help!)
looking at https://github.com/kubernetes/kubernetes/blob/03837fe6075590aea3a71d36a93b14cf9ce8e7b3/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go#L176 it seems we need to configure scheduler to reduce the weight on "SelectorSpreadPriority"....However, this is really something i want to do per Namespace, and not globaly for the entire cluster.
Will try with an overwrite of --policy-config-file to put more weight on "InterPodAffinityPriority" VS "SelectorSpreadPriority"...as per https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler.md#modifying-policies

Thank you for your help.

Regards

k82cn commented 7 years ago

What i am seeing is that when i increase replicas to 2 in RC-a, a new pod is indeed scheduled on Node1, but when i increase it to 3, 4, 5...etc, then it goes to Node4 then Node5 and then Node1 again,

Interesting, let me try your profile to see what happened :).

djsly commented 7 years ago

Hey @k82cn , had you had the chance to look at Fares' Profile? We are in the process of upgrading our Dev cluster to 1.5.3 to get somewhat closers to the official release. This should reduce the variables here for this investigation.

Thanks.

k82cn commented 7 years ago

@djsly , did not get chance to check it yet. Please try 1.5.3 and let me know if anything I can help then :).

alex88 commented 7 years ago

I've the same issue on 1.5.2, my deployment configuration is

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: ingress-nginx
  namespace: default
  labels:
    k8s-addon: ingress-nginx.addons.k8s.io
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: ingress-nginx
        k8s-addon: ingress-nginx.addons.k8s.io
      annotations:
        scheduler.alpha.kubernetes.io/affinity: >
          {
            "podAntiAffinity": {
              "preferredDuringSchedulingIgnoredDuringExecution": [{
                "labelSelector": {
                  "matchExpressions": [
                    { "key": "app", "operator": "In", "values": ["ingress-nginx"] }
                  ]
                },
                "topologyKey": "kubernetes.io/hostname",
                "weight": 1
              }]
            }
          }
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - image: gcr.io/google_containers/nginx-ingress-controller:0.8.3
          name: ingress-nginx
          imagePullPolicy: Always
          ports:
            - name: http
              containerPort: 80
              protocol: TCP
            - name: https
              containerPort: 443
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz
              port: 10254
              scheme: HTTP
            initialDelaySeconds: 30
            timeoutSeconds: 5
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          args:
            - /nginx-ingress-controller
            - --default-backend-service=$(POD_NAMESPACE)/nginx-default-backend
            - --nginx-configmap=$(POD_NAMESPACE)/ingress-nginx

trying to spread the nginx ingress containers across hosts, however after the update 2 pods has been scheduled on the same node

alex88 commented 7 years ago

Update: after changing from preferred to required all the pods failed at first with pod failed to fit in any node fit failure summary on nodes : MatchInterPodAffinity (3), PodToleratesNodeTaints (1). Then once every 30-40 seconds the containers started each one in a different node correctly, but still, preferred should work because with required I can't schedule more pods than the number of nodes

faresj commented 7 years ago

Hi So update on this: I FINALLY GOT THE BEHAVIOR DESIRED... : )

I am still using K8s 1.4.8 (will test to confirm on 1.5.3 soon), it seems i needed to put more weight on "InterPodAffinityPriority" VS "SelectorSpreadPriority"...as per https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler.md#modifying-policies

I changed the /etc/kubernetes/scheduler to pass

KUBE_SCHEDULER_ARGS="--leader-elect=true --policy-config-file=/frj_scheduler_policy.json --feature-gates=AllAlpha=true"

where frj_scheduler_policy.json content

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "NoDiskConflict"},
    {"name" : "GeneralPredicates"},
    {"name" : "PodToleratesNodeTaints"},
    {"name" : "CheckNodeMemoryPressure"},
    {"name" : "CheckNodeDiskPressure"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchInterPodAffinity"},
    {"name" : "PodFitsHostPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "NodePreferAvoidPodsPriority", "weight" : 10000},
    {"name" : "NodeAffinityPriority", "weight" : 1},
    {"name" : "TaintTolerationPriority", "weight" : 1},
    {"name" : "SelectorSpreadPriority", "weight" : 1},
    {"name" : "InterPodAffinityPriority", "weight" : 1500}
    ]
}

My test now show that pods of RC-a are always added to the same node where RC-a pod is already running, AND only when the capacity of CPU is exceeded the next pod will either flote (not be scheduled) or get scheduled on a node where no other RC-b or RC-c pods are running.....

Note: This is with the original affinity/antiaffinity rule:

 "scheduler.alpha.kubernetes.io/affinity": "{\"podAffinity\":{\"preferredDuringSchedulingIgnoredDuringExecution\":[{\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"Exists\"},{\"key\":\"pod_label_xyz\",\"operator\":\"In\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]}          ,          \"podAntiAffinity\":{\"requiredDuringSchedulingIgnoredDuringExecution\":[{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"Exists\"},{\"key\":\"pod_label_xyz\",\"operator\":\"NotIn\",\"values\":[\"value-a\"]}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}]          ,          \"preferredDuringSchedulingIgnoredDuringExecution\":[{\"weight\":100,\"podAffinityTerm\":{\"labelSelector\":{\"matchExpressions\":[{\"key\":\"pod_label_xyz\",\"operator\":\"DoesNotExist\"}]},\"namespaces\":[\"sspni-882-frj\"],\"topologyKey\":\"kubernetes.io/hostname\"}}]   }}"

This now mimics the same weight attributed to the predicates and priorities in the code (https://github.com/kubernetes/kubernetes/blob/03837fe6075590aea3a71d36a93b14cf9ce8e7b3/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go), except for the change of the weight on "InterPodAffinityPriority"which is not exercised by anyone else for now.....

The obvious concerns is:

we do not want to keep maintaining the overrides interdependently from the code...it seems the override must list all the predicates and policies even if we are interested in updating the weight of only one of them.....(any thoughts on this?)
we do not want to impose this change to all the namespaces in my cluster, Not sure it makes sense to impose this per Namespace as affinity& anti-affinity could be across namespaces....tbd i guess...

Question: Is there any reason not to have the "InterPodAffinityPriority" weight set to higher value by default? it seems more appropriate for cases where someone wants to explicitly override the "spread out" policy...and harmless if no affinity is set. Thank you all for the help!

kevin-wangzefeng commented 7 years ago

@faresj

Question: Is there any reason not to have the "InterPodAffinityPriority" weight set to higher value by default? it seems more appropriate for cases where someone wants to explicitly override the "spread out" policy...and harmless if no affinity is set.

It's just we don't have much background data as inputs to set a higher/proper default value of the weight for "InterPodAffinityPriority" comparing to other priorities.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle rotten /remove-lifecycle stale

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

kubernetes / kubernetes

K8s pods affinity & anti-affinity, soft (preferredDuringScheduling) not respected in 1.4? #41584