jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
615 stars 221 forks source link

External Cluster Environments #1244

Open Shrinjay opened 1 year ago

Shrinjay commented 1 year ago

Problem Statement

This pull requests addresses issue #1235 and enables multi-cluster operations for Jupyter Enterprise Gateway. To reiterate the problem, currently enterprise gateway launches kernels on the cluster where it is currently running. This poses limitations in cases where we want the kernel to have access to resources on a remote cluster without running the enterprise gateway on that cluster specifically. Such cases often occur in the interest of security and isolation of internal services from production services. While the k8s client supports connecting to and launching/managing resources on a remote cluster, this feature just isn't implemented in the current client.

Feature Description

The changes in this PR implement the ability for users to provide a kubeconfig file for Jupyter Enterprise Gateway to use to launch kernels on a remote cluster that the kubeconfig points to. Specifically, Jupyter Enterprise Gateway will:

Operator Instructions:

  1. Ensure your two clusters have interconnceted networks. Pods in the two clusters must be able to communicate with each other over pod IP alone.
  2. Provide a kubeconfig file for use in the config/ subdirectory of etc/kubernetes/helm/enterprise-gateway chart.
  3. Set externalCluster.enabled to true.

Implementation Details

When the operator enables external cluster operation and installs/upgrades the helm chart, the following steps occur:

When the enterprise gateway pod starts:

When a kernel is launched:

Assuming network interconnection is setup correctly, the kernel pod should now launch and be able to communicate.

Some notes on the implementation

welcome[bot] commented 1 year ago

Thanks for submitting your first pull request! You are awesome! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly. welcome You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

Shrinjay commented 1 year ago

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

kevin-bates commented 1 year ago

Hi @Shrinjay - thank you for providing this pull request. I hope to be able to start reviewing this sometime this afternoon or tomorrow (PST).

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

That is correct.

Shrinjay commented 1 year ago

Hello @kevin-bates ! Happy to hear, looking forward to the review :)

lresende commented 1 year ago

Are these merely to access resources on a different cluster in general? Or do we have the intention to enable mapping userA to clusterA and userB to clusterB? And how is that done?

kevin-bates commented 1 year ago

Good question @lresende. Reading the (excellent!) description, I believe this is more of a one-time thing -EG is either managing kernel pods within the cluster in which it resides, or EG is managing kernel pods within an external cluster in which it has access but does not reside. If that statement is correct, it might be better to update the title (and feature name) to something like External Cluster Support, or similar since the current title can imply multiple, and simultaneous, cluster support.

@Shrinjay - thoughts?

Shrinjay commented 1 year ago

@kevin-bates I absolutely agree, I didn't think about it like that but calling it External Cluster and switching the helm charts to use the prefix externalCluster makes sense.

Shrinjay commented 1 year ago

Updated!

kevin-bates commented 1 year ago

Hi @Shrinjay - gentle ping - could you please address the review comments?

Shrinjay commented 1 year ago

@kevin-bates My deepest apologies, I was working on this as part of an initiative at work, but I got pulled off to work on something else and this completely slipped my mind. I'm back on this now, and my contract is coming to an end, so I'll be addressing these comments and looking to get this PR done within the upcoming weeks.

Shrinjay commented 1 year ago

@kevin-bates Made some updates based off your suggestions, however your comment on testing got me thinking, and I can't figure out a great way to test this in a repeatable way. Currently, we just test this manually, the only success criteria is that a kernel is able to launch as everything after that is the same as if it were running on a local cluster. I can't really tell if the processproxies and such are tested in itests as well. If they are, then we could replicate the processproxies test harness for this. Any guidance?

kevin-bates commented 1 year ago

Hi @Shrinjay - thanks for the update.

Regarding testing, we have no tests for Kubernetes process proxies. The integration test we have uses the enterprise-gateway-demo image to test process proxies YarnClusterProcessProxy and DistributedProcessProxy (in loopback mode). I'm building a mock layer in Gateway Provisioners to try to improve this, but, for now, we can keep things as is and rely on manual testing.

My biggest areas of concern are the behaviors around BYO Namespace (where the client pre-creates the namespace and references that name in KERNEL_NAMESPACE in the start kernel request) and Shared Namespace (where the kernels reside in the EG namespace). The latter doesn't really apply since namespaces can't span clusters, but it would be good to understand the behavior - especially since this is the default in Gateway Provisioners. However, if the other cluster happens to also defined the same-named namespace, what happens? I think we should probably raise an exception in this case.

I will try to spend some time reviewing this PR in the next few days.