Prototype 3: Transparent Multi-Cluster Cordon/Drain End-User Demo

pweil- commented 2 years ago

Demo Objective

User has a multi-cluster placeable application that can move transparently

Demo Steps

User creates a stateless web application, which is assigned to a physical cluster
Physical cluster admin wants to perform maintenance on the cluster, but limit workload disruption
Admin marks the cluster as cordoned (Unschedulable: true) -- new workloads are not assigned to the cluster
Admin marks the cluster as drained/draining (EvictAfter: $now) -- existing workloads are rescheduled to another cluster, with some observed downtime
With no workloads now scheduled on the cluster, the admin is free to operate on it, upgrade it, uninstall syncer, delete the cluster, etc.

Action Items

[x] Scope the current demo as necessary to fit in prototype boundaries
[x] Create and link required tasks to realize demo
[x] #556
[x] #524
[ ] Contribute to final demo script and recording

Nice to have

[ ] #525

pweil- commented 2 years ago

Notes from discussion:

User logs into the cluster with kubectl kcp login and some OIDC authn (e.g. Github)

if there is an existing kubectl login plugin we should just use it

marun commented 2 years ago

User decides to migrate the app between Locations for Reasons Option 1: Admin is upgrading the cluster with downtime Option 2: Cluster is out of capacity Option 3: Cluster hard fails

Do these specific options matter for the purposes of the demo? vs just forcing a move from one cluster to another? afaik today we aren't even targeting allowing a 'User' (vs an admin) to choose a cluster to deploy to let alone migrate between - workload is just deployed to available compute (ala the 'Transparent' adjective of TCM).

Client has a multi-location security problem (needs a cert to access some external resource), THUS needs the https://github.com/kcp-dev/kcp/issues/416

Any context on why this is part of this flow?

kylape commented 2 years ago

Do we need to use krew specifically? Perhaps we could simply copy the binary to the right place or even just start the demo with it already installed?
Can we use Red Hat SSO for the OIDC provider, or do we want it to be more generic?
I don't think we should use "cluster hard fails" as the reason to migrate. IMO in that scenario the user doesn't choose to migrate but kcp would detect and migrate on its own, which is not what is in scope here.
Does the multi-location security problem just mean that we do this demo without TLS?

kylape commented 2 years ago

Client has a multi-location security problem (needs a cert to access some external resource), THUS needs the https://github.com/kcp-dev/kcp/issues/416

Any context on why this is part of this flow?

It's a transition to the next demo afaict.

marun commented 2 years ago

Client-facing traffic sees no interruption (?)

@jmprusi Thoughts as to what will be required to demo this? I know this can work with a cloud lb, but will that also be required for ci or is there a lighter-weight way to ensure this flow is tested once supported?

kylape commented 2 years ago

Do these specific options matter for the purposes of the demo? vs just forcing a move from one cluster to another? afaik today we aren't even targeting allowing a 'User' (vs an admin) to choose a cluster to deploy to let alone migrate between - workload is just deployed to available compute (ala the 'Transparent' adjective of TCM).

I believe the reason is to create a narrative for the demo. It will be used as justification for an admin to remove a the cluster from the user's workspace to force a migration.

marun commented 2 years ago

I believe the reason is to create a narrative for the demo. It will be used as justification for an admin to remove a the cluster from the user's workspace to force a migration.

I guess I don't really get the point of forcing a specific narrative before we have actual implementation. Maybe it's supposed to be motivating, but it feels forced to me.

pweil- commented 2 years ago

Any context on why this is part of this flow?

Here is the community call where these were defined if it helps to review. https://www.youtube.com/watch?v=_9ilcimFyec

kylape commented 2 years ago

I don't think we need to decide on a specific migration reason at this point. Just that it means we need to demo workload migration by removing the cluster running the demo workload.

stevekuznetsov commented 2 years ago

"I need access to CUDA compute, so I move to a different data center" :)

jmprusi commented 2 years ago

Client-facing traffic sees no interruption (?)

@jmprusi Thoughts as to what will be required to demo this? I know this can work with a cloud lb, but will that also be required for ci or is there a lighter-weight way to ensure this flow is tested once supported?

So... let me braindump (sorry) here:

Something triggers a cordon (unschedulable: true) and drain (eviction?) (step 6), so the scheduler needs to have some notion of those concepts
The scheduler has to also "add" the workload into the other location (without taking down the current one), so we need a way to specify multiple scheduling locations (currently we use an annotation with a single cluster and we don't have anything handling multiple locations...), so temporary it should duplicate the workload to another location.
The ingress-controller, follows the service, so if a service "targeted" by an ingress has a new location, the controller will create a new leaf for that new location, two in this case.
Once the new Ingress/service/deployment is running on the other cluster, the controller should configure Envoy to send traffic to both places, lb between them.
And now, we should coordinate properly between the ingress and the scheduler... this is tricky. But ideally, as the ingress-controller can see that a workload is going to be retired (assigned to a location that is in draining mode?) it should move all the traffic to the non-draining location, and remove the draining location from the balancing...

this gets really tricky when we add long-running connections, WebSockets or so... also a more advanced scenario would be to use the cluster gateways information to understand when the traffic has fully switched and then take down the workloads etc..

imjasonh commented 2 years ago

User decides to migrate the app between Locations for Reasons

If the reason is "the cluster is deleted", then this is effectively not demonstrating anything different from prototype 2 AFAIK. (That may be fine, for scoping down this prototype)

If the reason is "my pcluster scheduling constraints changed", then we need to design and implement scheduling constraints, which feels like a heavy lift. Same for designing and enforcing capacity as a scheduling constraint.

Maybe the best compromise is having the demo say "I've decided to manually move my app to europe-west to be closer to customers", which demos manual cordoning, eviction, etc., which ingress can react to with a more graceful, slower cutover. This also means we don't have to design/implement triggering that slow move automatically in response to TBD scheduling constraint changes, and leave that for P4+.

It also means a future demo of "my app automatically detects it would get lower end-user latency by moving to europe-west and triggers that itself" is a natural automation of a previous demo milestone, should we ever get to that point.

marun commented 2 years ago

afaik the very concept of Locations is a topic of discussion. Maybe that should be the target for P3 - defining the mechanics of a cluster-abstraction concept (i.e. Location) at the workspace level that allows admins to hide the details of physical cluster association?

I'm still not clear what the Location abstraction implies wrt associating a given kcp namespace with compute capacity. I've been party to discussion suggesting that a given workspace could 'inherit' Locations from other workspaces and that a workspace would define a default Location for scheduling purposes.

It's less clear to me how a user would indicate their intent to prefer one location over another - is this 'scheduling constraints'? Without scheduling constraints, what mechanism would a user have for switching from one Location to another to satisfy P3? Or when we say 'user' do we really mean an 'administrator' that would have permission to remove a Location such that a namespace associated with it would be forced to be scheduled to another Location?

marun commented 2 years ago

this gets really tricky when we add long-running connections, WebSockets or so... also a more advanced scenario would be to use the cluster gateways information to understand when the traffic has fully switched and then take down the workloads etc..

What's the best way to demo this then? In terms of stateless app it could be as simple as ngninx serving hello world... But again, what endpoint will we be targeting for demo that will ensure seamless handover between the application on one cluster to the application on another?

imjasonh commented 2 years ago

I'm still not clear what the Location abstraction implies wrt associating a given kcp namespace with compute capacity.

Same, I think it's under-designed so far. We don't really have a plan for limiting what the syncer can sync to a cluster, or bubbling up "syncer doesn't have capacity" to kcp. We would bubble up "workloads on the pcluster are unschedulable", whether that's pcluster-wide resource exhaustion or pcluster namespace quota limits. But so far we don't have anything that would limit a workspace's footprint on a pcluster, or even really where that enforcement happens (pcluster-scheduling-time? syncing time?)

It's less clear to me how a user would indicate their intent to prefer one location over another - is this 'scheduling constraints'?

Also under-designed at this time. At a high level users should say "put this workload where there's CUDA resources", or even more simply "put this workload in any N of M locations", but the language for that is still TBD. Clayton's talked about reusing node scheduling hints for pcluster scheduling, but I'm not convinced that's a good idea.

Without scheduling constraints, what mechanism would a user have for switching from one Location to another to satisfy P3? Or when we say 'user' do we really mean an 'administrator' that would have permission to remove a Location such that a namespace associated with it would be forced to be scheduled to another Location?

In the absence of a constraint language and automatic enforcement mechanism, we can at least demo "manually cordon us-east (by annotating it)", instead of P2's "forcibly unplug us-east", which would demo a more graceful rescheduling that allows the Ingress to move over without downtime.

What's the best way to demo this then? In terms of stateless app it could be as simple as ngninx serving hello world... But again, what endpoint will we be targeting for demo that will ensure seamless handover between the application on one cluster to the application on another?

This could demoed with a job pinging demo.example.com/hello every 100ms, that doesn't see any 5XX errors while in another window we see the deployment shift replicas from Location A to B. WebSockets are harder, so let's just ignore them for now.

marun commented 2 years ago

This could demoed with a job pinging demo.example.com/hello every 100ms, that doesn't see any 5XX errors while in another window we see the deployment shift replicas from Location A to B. WebSockets are harder, so let's just ignore them for now.

How is this going to work across multiple clusters? How do we enable transparent switching between applications in multiple clusters, except with some kind of intermediary (e.g. proxy).

To be clear, I'm looking for a way to validate this in CI as a precondition for having this be demoable, but likely a ci-testable option would work equally for demo.

imjasonh commented 2 years ago

How is this going to work across multiple clusters? How do we enable transparent switching between applications in multiple clusters, except with some kind of intermediary (e.g. proxy).

The steps for that are roughly what @jmprusi describes in https://github.com/kcp-dev/kcp/issues/415#issuecomment-1033967052

The pcluster being cordoned triggers the scheduler to duplicate the workload on some other cluster, including service+ingress, and the previous cluster's ingress proxies to the new one until some cutover.

It's quite a bit slower than just pulling the plug on the old cluster -- and might be slow enough that it means we can't practically cover it in CI -- but that's the price of zero downtime. I think we could even punt on total zero downtime if it's ~1s or something, and we can demonstrate that a more graceful reschedule than what P2 does today.

marun commented 2 years ago

The pcluster being cordoned triggers the scheduler to duplicate the workload on some other cluster, including service+ingress, and the previous cluster's ingress proxies to the new one until some cutover.

It's quite a bit slower than just pulling the plug on the old cluster -- and might be slow enough that it means we can't practically cover it in CI -- but that's the price of zero downtime. I think we could even punt on total zero downtime if it's ~1s or something, and we can demonstrate that a more graceful reschedule than what P2 does today.

I'm more than a little surprised that local proxying would be an end-goal here, or that it would be a reasonable way of ensuring zero-downtime.

@smarterclayton Maybe you can chime in as to your expectations?

imjasonh commented 2 years ago

I'm more than a little surprised that local proxying would be an end-goal here, or that it would be a reasonable way of ensuring zero-downtime.

I don't think local proxying is the end-goal at all, just a step along the path that's achievable in the immediate timeline.

pweil- commented 2 years ago

Right, let's keep in mind that we are exploring concepts that allow us to show a compelling vision of the value of something like KCP and enable others to poke at it for their use cases. We have to balance that need with what we think the long term engineering solutions may be.

davidfestal commented 2 years ago

Just a comment about:

User lands in a default workspace.

Does this mean that the previous step kubectl kcp login <url> would transparently perform the equivalent of a: kubectl kcp create workspace <default workspace name> --use so that the user is directly inside a personal workspace ?

For now no workspace is created nor listed by default for a user. Created and linked this follow-up issue https://github.com/kcp-dev/kcp/issues/488 to discuss this in more details.

sttts commented 2 years ago

cc @s-urbaniak for the authn aspect of this story and also to sync our view on "User lands in a default workspace". What does "land in" mean after a login command.

imjasonh commented 2 years ago

Demo Steps

User creates a stateless web application, which is assigned to a PCluster (as in P2)
PCluster admin wants to perform maintenance on the cluster, but limit workload disruption
Admin marks the cluster as cordoned (Unschedulable: true) -- new workloads are not assigned to the cluster
Admin marks the cluster as drained/draining (Evict: true) -- existing workloads are rescheduled to another cluster, with some observed downtime
With no workloads now scheduled on the cluster, the admin is free to operate on it, upgrade it, uninstall syncer, delete the cluster, etc.

@ncdc @robszumski

chirino commented 2 years ago

This issue title seems kinda of related to 2.1/2.2 in the transparent multi-cluster use case doc: https://docs.google.com/document/d/1LeYMt4I1No1W-tj6LCPuXE7kTSD-uggC8IRI6DVpHOM/edit?hl=en&forcehl=1#heading=h.kmn31tiyv4vs

Should we also be able do a simpler demo where something like a LogicalClusterPolicy is updated to run a stateless app in multiple clusters? Seems like being able to run an app redundantly is the first step you need before you can move it without downtime.

ncdc commented 2 years ago

@chirino this has been scoped down to the updated set of demo steps now seen in the issue description. The net new features here are the cordoning and draining of a physical cluster.

Cluster placement/scheduling policies will come later, via separate issue(s).

imjasonh commented 2 years ago

This is done except for including it in the demo script.

kcp-dev / kcp