jupyterhub / zero-to-jupyterhub-k8s

Helm Chart & Documentation for deploying JupyterHub on Kubernetes
https://zero-to-jupyterhub.readthedocs.io
Other
1.55k stars 796 forks source link

Making user-placeholders not block scheduling of user pods with image locality preference #1414

Open consideRatio opened 5 years ago

consideRatio commented 5 years ago

@yuvipanda suggested some optimizations relating to the user-placeholder pods. I like these ideas so I'll elaborate on them here. To understand the optimizations, we need to understand the issues we currently have first, we can do that in a thought experiment!

Thought experiment

  1. An autoscaling k8s cluster has nodes that can fit 10 users and there are 5 user placeholder pods. Currently the first and only node has 7 users and 3 placeholders, and 2 placeholders are pending.

  2. A second node is added as there were pending pods, and is now starting up slowly.

  3. The first new user arrives, a placeholder is pre-empted on the only available node to make room for the new user. If the user would been able to schedule without pre-empting a placeholder pod, the pre-emption would not have happened.

  4. The second node becomes ready for pod scheduling, and is starting to pull images thanks to the continuous pre-puller daemonset pod that quickly scheduled there. Currently 8 users and 2 placeholders are on the first node, and the second node has 3 placeholders on it.

  5. The second new user arrives, and is scheduled on the node with most resources used which has room, which becomes the second node. This node has not yet pulled all the images though so the user need to wait. The second node now has 1 user and 3 placeholders on it.

    Optimization opportunity (OO#1): Force scheduling of user pods on nodes with image locality.

  6. The image pre-pulling on the second node by the continuous pre-puller daemonset pod completes.

  7. The third new user arrives, and is scheduled on the second node that now has 2 users and 3 placeholders on it.

    Optimization Opportunity (OO#2): Force scheduling on the most truly busy node where user-placeholder pods shouldn't count. Auto scaling down a node will become easier.

Analysis

  1. The user-placeholder pods help trigger auto scaling of nodes ahead of time by ending up in a pending state after getting pre-empted by user pods filling up the available nodes. Well done user-placeholder pods and thanks k8s for the pod priority mechanism! We made scale up ahead of time!
  2. When the new node has been become available for scheduling, it will no longer pre-empt user-placeholder pods in order to schedule as it can simply schedule on the new node. The kube-scheduler will not evict unless it is required to in order to schedule the pod. The problem is that this node still isn't ready for the user pods as they want their images pulled on it already.

Solution ideas

Required node labels

We could maintain pre-emption of placeholder pods until image locality is available indirectly. We could ourselves make sure to set node labels that the user pods require.

The reason that we need to hack this is because image locality cannot be enforced, it can only improve the schedulers score of a node when ranking potential nodes to schedule on, but it would need to filtered out the node entirely from consideration in order to consider pre-emption of lower priority pods like the placeholder pods are.

To hack around this, we could set a hard affinity for the user pod to nodes that have a custom labels set on them if the images are available. We could then make the pre-puller daemonset pods maintain this label by first removing it in the first init-container and then adding it back in the last init-container for example.

Solution discussion

This solution would solve the image locality optimization, but fail with the busy node optimization.

Timed rescheduling of user-placeholder pods

If we would systematically reschedule the user placeholder pods to the least busy user node somehow. Such rescheduling could be done external to the placeholder, or by the placeholder pods themselves.

user-placeholder pod external rescheduling logic

For example, if there was a cronjob strategy we could deschedule all the user-placeholder pod at the same time. Another example would be to have a strategy that descheduled pods after a minute of lifetime.

user-placeholder pod internal rescheduling logic

They could also be internally descheduled, not by crashing though as that only restarts the container, but by speaking with the kubernetes API server and deleting themselves after a configurable amount of time. For this to work, we would attach a RBAC ServiceAccount bound through a RoleBinding to a Role that would give permission to delete pods in the namespace. We would use the k8s Downward API to expose the name of the pod to the logic in the container using the k8s go client so it can ask for self destruction.

Affinities to improve further

We can improve this solution strategy of rescheduling user-placeholder pods further. We can be setting a soft anti-affinity of the user pods for the user-placeholder pods and vice versa. Like this we would avoid the situation where a user pod could end up on a node with only placeholders because it was considered more busy even though there were a less busy node but with only real users.

It is important to note that these affinities are either met or unmet, having five user-placeholder pod on a node would be as bad as having one for a soft anti-pod-affinity. This means that if we for example would reschedule pods one at the time, we could end up in a situation where these affinities would fail to be used.

Improve further again - let placeholder pods use the default scheduler

Currently, we are scheduling the user-placeholder pods with the user-scheduler which would place them together, but we don't want this, so we should instead use the default scheduler that will instead try to distribute workloads evenly across the nodes through various scoring mechanisms.

This change should be done no matter what I think.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/d25bcd9e38aca2734f90e51c2bc9a0bf9d90ca3e/jupyterhub/templates/scheduling/user-placeholder/statefulset.yaml#L33-L35

Solution discussion

This would be both an image locality optimization and an optimization to schedule on the busier nodes in order to increase the likelihood of being able to scale down a node. It would also be very plausible to do.

Conclusions

The internal configurable timed pod deletion of user-placeholder pods seems like the best idea to me. I also like the idea of making a generic self-destruct binary. If there are no such go binary along with a docker image already, I'd like to make such as a standalone micro open source project!

Retrospective - What is the desired dynamics and the desired outcomes?

We want to figure out what the simplest rule set is to reach as far as possible towards the desired dynamics, that was understand to lead to the desired outcomes. To figure it out, we need to first clarify the desired outcomes.

Desired outcomes

Ideal rule set

I think this single rule would lead to the desired outcomes:

Currently, it isn't the case because the user-scheduler will consider the resource request of user-placeholder pods and also won't consider pre-empting a user-placeholder pod where possible to schedule.

Desired decisions during various relevant cluster states

Assume we can move around user placeholder pods, but not real user pods once they are scheduled. How would we move around the placeholder pods, and how would we schedule the real user pods?

  1. The cluster has a low average resource utilization and could scale down if pods moved around properly.
  2. The cluster could not fit all pods on one less node, and one node has pre-pullers that hasn't finished pulling yet.

Work to be done

yuvipanda commented 5 years ago

I've missed these amazingly thought out and researched issues when I've been out, @consideRatio <3

The self-destruct needs to happen in only the following case:

  1. There is a new, empty node
  2. We are not on it

We need to also try and make sure our self destruct would actually schedule us onto the new node, which might not always happen. Otherwise we'll end up with pod churn, which can cause problems. I deleted 60pods a second for an hour and discovered that nodes just do not like that and fail!

This makes me think we need the destruction to happen with a program with a global view of all pods and nodes than with a local view of itself. I wrote up a jq + bash script that does that, although it was far too aggressive. My hope is that we can use this descheduler strategy: https://github.com/kubernetes-sigs/descheduler#removepodsviolatinginterpodantiaffinity. It won't remove user pods since they are standalone, but should remove the statefulset pods. I'll check if it only cares about hard or soft affinities.

yuvipanda commented 5 years ago

Based on https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/pod_antiaffinity.go#L100 descheduler doesn't care about soft anti affinity, so isn't useful for us.

betatim commented 5 years ago

Could the image pre-puller pod cordon the node or otherwise taint it, so that user pods and placeholder pods, can't schedule on it? Then when it exits/finishes pulling it uncordons the node which becomes available for user pods to schedule on.

Another thought I had: could user pods have a required during scheduling anti-affinity to image puller pods? Would that prevent them from scheduling on a node with an active puller or also from scheduling on a node with a completed puller?

consideRatio commented 5 years ago

@yuvipanda I didn't understand the "we are not on it" idea. Do you mean these points to be a rule set to indicate that we should trigger rescheduling of user-placeholder pods? I think a suitable trigger rule could be:

  1. There is a new node
  2. The node has become possible to schedule on for our our pre-puller daemonset pods (they have the same set of hard affinities as the user pods and user placeholder pods).

I don't want to tunnel vision about having this logic in place yet though, I want to make sure I'm happy about what kind of distribution of real user pods and user placeholder pods we would like to see in various scenarios, and only then how to foster this.

Hmmm... If we do want a trigger to reschedule user-placeholder pods, it could perhaps be sent out before creating a real user pod instead or similar? Note that higher priority pods will be scheduled before lower priority pods if both are considered at the same cycle. We could also attempt to monitor pending pods, but the thing is that we need to react before the scheduling starts which also triggers on pending pods ^^. Also note that we would need to reschedule all the user-placeholder pods, not only some, as most but not all may not block the destination of where the user pod really wants to go, but we don't know what pod specifically will be blocking where the real user wants to go...

@betatim

Could the image pre-puller pod cordon the node or otherwise taint it, so that user pods and placeholder pods, can't schedule on it? Then when it exits/finishes pulling it uncordons the node which becomes available for user pods to schedule on.

This system is very much like how GPU nodes are handled, they are not made available until they have got their GPU devices attached and drivers installed. This requires a daemonset and communication with the API server using Cluster wide privileges. I think it is a too crude approach.

Another thought I had: could user pods have a required during scheduling anti-affinity to image puller pods? Would that prevent them from scheduling on a node with an active puller or also from scheduling on a node with a completed puller?

The puller pods all come from a daemonset, so they will be around at all time and not only during pulling which is done for the init containers that never really start up, they just enter their main container which is the pause container that simply puts the container to sleep. The closest practical idea like this in my mind is what i described under the "Required node labels" header in the original post. It only optimize one of two relevant optimizations though, the image locality one, but doesn't contribute with reducing the scaling down wait times.

yuvipanda commented 5 years ago

This system is very much like how GPU nodes are handled, they are not made available until they have got their GPU devices attached and drivers installed. This requires a daemonset and communication with the API server using Cluster wide privileges. I think it is a too crude approach.

I think this is actually very relavant. Let's take the idea of 'Ready' from a Kubernetes node. For our purposes, there are three kinds of readiness:

  1. Ready to receive any pods. This is what Kubernetes currently counts as 'Ready'
  2. Ready to receive user placeholder pods. This is same as (1)
  3. Ready to receive user pods. This is not something we capture yet - it should be true after the images have been pulled, and not before.

"pod ready++" (https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md) is a feature that acknowledges this need and makes it a first class feature for pods. Nothing like it exists for nodes yet though.

yuvipanda commented 5 years ago

More discussion about exactly our needs in https://github.com/kubernetes/kubernetes/issues/75890

yuvipanda commented 5 years ago

And a possible implementation coming in a future k8s version :) https://github.com/kubernetes/enhancements/pull/1003

consideRatio commented 5 years ago

@yuvipanda nice find!

It is quite a crude approach and a lot of machinery still. The crude parts are in my mind:

I'd like it more if the only thing that was done is to label the node some way and adding a hard affinity to that label for our real user pods. This could make everything work good even if multiple hubs run in the same cluster etc. The implementation could be done within one or two init container by the pre-puller pods which run first and last for example.

yuvipanda commented 5 years ago

I agree! I'm currently running the following in a loop:

# Get all nodes
from kubernetes import client, config 
config.load_kube_config()

v1 = client.CoreV1Api()
namespace = 'datahub-prod'

attractor_label = 'hub.jupyter.org/attract-placeholders'

def label_newest_nodes():
    nodes = sorted(v1.list_node(label_selector='hub.jupyter.org/node-purpose=user').items, key=lambda n: n.metadata.creation_timestamp, reverse=True)

    labeling_event = False

    for i, node in enumerate(nodes):
        if i == 0:
            # First node, ensure it has our attractor label
            if attractor_label not in node.metadata.labels:
                # Our youngest node doesn't have this label!
                node.metadata.labels[attractor_label] = 'true'
                v1.patch_node(node.metadata.name, node)
                print(f'Adding label to {node.metadata.name}')
                labeling_event = True
        else:
            if attractor_label in node.metadata.labels:
                # Setting value to None removes the labels
                node.metadata.labels[attractor_label] = None
                v1.patch_node(node.metadata.name, node)
                print(f'Removing label from {node.metadata.name}')
                labeling_event = True

    if labeling_event:
        print('deleting pods')
        v1.delete_collection_namespaced_pod(namespace, label_selector='component=user-placeholder')

The statefulset for placeholder pods has the following:

      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: hub.jupyter.org/attract-placeholders
                operator: In
                values:
                - "true"
            weight: 100
          - preference:
              matchExpressions:
              - key: hub.jupyter.org/node-purpose
                operator: In
                values:
                - user
            weight: 100
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: component
                  operator: In
                  values:
                  - singleuser-server
              topologyKey: kubernetes.io/hostname
            weight: 100

We set a label for the newest node, and then kill all the placeholder pods everytime we change the label. I should make it so it only kills placeholder pods not on the newest node...

Let's see how this goes!

consideRatio commented 5 years ago

Dear pink friend

I write to you in order to process tough questions, confident you will look at it thoroughly.

image

Mental simulation of dynamics with required node labels

Assume we maintain a label on nodes lacking the latest images to be pulled. What would happen in general? Generally I conclude that user-placeholder pods are only scheduled when there is room for both them and the user pods

Event 1: We scale up because user-placeholder pods were evicted, perhaps even all and real user pods went pending.

Event 2: The pod becomes ready for pods, but not yet for user pods.

The pending user-placeholder pods will schedule there!

Event 3: A user pod leaves the already user ready node.

A pending real user pod will take its place if there are any, or a user-placeholder pod if there are no real user pods. Excellent!

Event 4: User pods drop away and we end up with a lot of cluster utilization

Undesired outcomes

Let U define a user pod, and P define a placeholder pod, and + define empty space on a node. Also assume the nodes only fit five spots.

Could had been an issue, but wasnt...

Ways to influence the dynamics

When do we have certain issues btw?

Current thinking summarized

Making the real user pods require the dynamically managed images-are-pulled label is fine as long as there is user placeholder pods that can still trigger scale up, then we solve a lot of issues there in the only way I see reasonable.

Then the issue that remains is that user pods may schedule on a busy node because it had lots of user placeholder pods on it but no real actual users on it... I imagine the states which I think are plausible states, which would make various affinities less effective to avoid the wrong scheduling behavior

1: PPP++, UU+++ --- It is simply correct to schedule U on the right, but P is harder to tell where it should schedule as the left node cannot scale down unless it is below 50% resource utilization, but at the same time, this would only be an issue if we have more placeholder pods than 50% of a node though. 2: PPPU+, UU+++ --- We should schedule U to the right again, actually in all examples below because the right side has the most U's. 2: PPP++, PU+++ 2: PPPU+, PUU++

Hmmm... So....

consideRatio commented 5 years ago

@yuvipanda I think your script would struggle to resolve the situation alone. If you end up with a new node that is schedulable at all, the first pending pod to schedule would be the user pods as they have higher priority. So, I figure the essence is you cannot make the user-placeholder pods schedule first on the fresh node unless you also disallow the real user pods from scheduling there using a label that you require as a hard affinity for the real users but not for the placeholders.

Oh hmm... Ah but I guess the real users would choose to not schedule there in the first place, they would schedule on the most resource utilized node, and THEN the user placeholder pods would schedule on the attracted node...

Nevermind, complicated, may work!

consideRatio commented 5 years ago

Im on a train of thoughts but also on rails..

Four affinity rules to consider

What are essential affinities that should carry the most weight? How does the weighting work btw? Hmmm...

U to U is the greatest and always most important. U to anti P, and, P to anti U is good to maintain separation but less important than the U to U affinity. P to P... Hmmmm... Is the dislike of U most important? I think so as it could block future U to schedule next to their own. So this can help slightly but should be valued the least.

helpful tech

yuvipanda commented 5 years ago

I'm deploying https://github.com/berkeley-dsep-infra/datahub/pull/1050 now, will keep you posted on how it goes! It isn't a long term solution, just a fix for now.

consideRatio commented 5 years ago

Reference

yuvipanda commented 4 years ago

I've now removed my workaround in https://github.com/berkeley-dsep-infra/datahub/pull/1657, since it was complex and difficult to use.