Shrinjay commented 1 year ago

Problem Statement

This pull requests addresses issue #1235 and enables multi-cluster operations for Jupyter Enterprise Gateway. To reiterate the problem, currently enterprise gateway launches kernels on the cluster where it is currently running. This poses limitations in cases where we want the kernel to have access to resources on a remote cluster without running the enterprise gateway on that cluster specifically. Such cases often occur in the interest of security and isolation of internal services from production services. While the k8s client supports connecting to and launching/managing resources on a remote cluster, this feature just isn't implemented in the current client.

Feature Description

The changes in this PR implement the ability for users to provide a kubeconfig file for Jupyter Enterprise Gateway to use to launch kernels on a remote cluster that the kubeconfig points to. Specifically, Jupyter Enterprise Gateway will:

Create a service account (if enabled)
Create a namespaced role for the kernel pod
Create a namespaced role binding for the kernel pod
Create the kubernetes resource required for the kernel

Operator Instructions:

Ensure your two clusters have interconnceted networks. Pods in the two clusters must be able to communicate with each other over pod IP alone.
Provide a kubeconfig file for use in the config/ subdirectory of etc/kubernetes/helm/enterprise-gateway chart.
Set externalCluster.enabled to true.

Implementation Details

When the operator enables external cluster operation and installs/upgrades the helm chart, the following steps occur:

A ConfigMap is created for the kubeconfig and mounted into the enterprise gateway pod
A ConfigMap is created for the kernel pod role and mounted into the enterprise gateway pod
The following env vars are set for multi-cluster operation: EG_USE_REMOTE_CLUSTER, EG_REMOTE_CLUSTER_KUBECONFIG_PATH, EG_DEFAULT_KERNEL_SERVICE_ACCOUNT_NAME

When the enterprise gateway pod starts:

An atomic kubernetes client is produced from the Kubernetes Client Factory pointing to the kubeconfig the operator passed in. All further operations use this client.

When a kernel is launched:

If configured, a service account is created in the remote cluster (if one does not exist) for the kernel pod
A namespaced role is created in the remote cluster for the kernel pod
A namespaced role binding between the role and service account is created for the kernel pod
The env variables above are passed to the kernel-launcher
The kernel-launcher uses the kubernetes client to launch the kernel in the remote cluster

Assuming network interconnection is setup correctly, the kernel pod should now launch and be able to communicate.

Some notes on the implementation

We've done our best to maintain as much of the smooth experience enterprise gateway currently has. However, automating network interconnection between two clusters is an infrastructure problem that's far too broad to be included as an automation here. Therefore, network interconnection is left as the responsibility of the operator.
We switch from using a statically defined client to an atomic one produced by a factory for best practice. It's much more clear where the configuration for a client comes from in this approach.
We also strived to make as much of this configurable via Helm as possible. Therefore, the kernel role for the kernel pod is defined in helm, mounted as a configmap and then read by the process proxy, rather than defined in the process proxy itself like the role bindings.

welcome[bot] commented 1 year ago

Thanks for submitting your first pull request! You are awesome! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly. welcome You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

Shrinjay commented 1 year ago

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

kevin-bates commented 1 year ago

Hi @Shrinjay - thank you for providing this pull request. I hope to be able to start reviewing this sometime this afternoon or tomorrow (PST).

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

That is correct.

Shrinjay commented 1 year ago

Hello @kevin-bates ! Happy to hear, looking forward to the review :)

lresende commented 1 year ago

Are these merely to access resources on a different cluster in general? Or do we have the intention to enable mapping userA to clusterA and userB to clusterB? And how is that done?

kevin-bates commented 1 year ago

Good question @lresende. Reading the (excellent!) description, I believe this is more of a one-time thing -EG is either managing kernel pods within the cluster in which it resides, or EG is managing kernel pods within an external cluster in which it has access but does not reside. If that statement is correct, it might be better to update the title (and feature name) to something like External Cluster Support, or similar since the current title can imply multiple, and simultaneous, cluster support.

@Shrinjay - thoughts?

Shrinjay commented 1 year ago

@kevin-bates I absolutely agree, I didn't think about it like that but calling it External Cluster and switching the helm charts to use the prefix externalCluster makes sense.

Shrinjay commented 1 year ago

Updated!

kevin-bates commented 1 year ago

Hi @Shrinjay - gentle ping - could you please address the review comments?

Shrinjay commented 1 year ago

@kevin-bates My deepest apologies, I was working on this as part of an initiative at work, but I got pulled off to work on something else and this completely slipped my mind. I'm back on this now, and my contract is coming to an end, so I'll be addressing these comments and looking to get this PR done within the upcoming weeks.

Shrinjay commented 1 year ago

@kevin-bates Made some updates based off your suggestions, however your comment on testing got me thinking, and I can't figure out a great way to test this in a repeatable way. Currently, we just test this manually, the only success criteria is that a kernel is able to launch as everything after that is the same as if it were running on a local cluster. I can't really tell if the processproxies and such are tested in itests as well. If they are, then we could replicate the processproxies test harness for this. Any guidance?

kevin-bates commented 1 year ago

Hi @Shrinjay - thanks for the update.

Regarding testing, we have no tests for Kubernetes process proxies. The integration test we have uses the enterprise-gateway-demo image to test process proxies YarnClusterProcessProxy and DistributedProcessProxy (in loopback mode). I'm building a mock layer in Gateway Provisioners to try to improve this, but, for now, we can keep things as is and rely on manual testing.

My biggest areas of concern are the behaviors around BYO Namespace (where the client pre-creates the namespace and references that name in KERNEL_NAMESPACE in the start kernel request) and Shared Namespace (where the kernels reside in the EG namespace). The latter doesn't really apply since namespaces can't span clusters, but it would be good to understand the behavior - especially since this is the default in Gateway Provisioners. However, if the other cluster happens to also defined the same-named namespace, what happens? I think we should probably raise an exception in this case.

I will try to spend some time reviewing this PR in the next few days.

jupyter-server / enterprise_gateway

External Cluster Environments #1244

Problem Statement

Feature Description

Operator Instructions:

Implementation Details