jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
620 stars 223 forks source link

[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP) #1105

Open rahul26goyal opened 2 years ago

rahul26goyal commented 2 years ago

Problem Statement

With JEG running on a remote machine and handling the kernel life cycle, Notebook users can not longer change the Kernels specs / properties locally which would update the configuration with which Spark kernel comes up. There are various use cases where users want to play around and experiment with different spark configuration to arrive at the final configs which best suit their workload. These configs also might vary from one notebook to another based on the workload the notebook is doing. JEG is also used as we multi-tenant service where each user might want to tweak the kernel based on his/ her scenario. Thus, there is a need for users to be able to update the kernel / spark properties at runtime from the notebook.

Feature Description

The changes proposed in this PR are to add support for a well known magic %%configure -f {} which allows Notebook users to change the spark properties at runtime without having to create / update any kernel spec file. This would allow users to change spark driver, executor resources (like cores, memory), enable / disable spark configuration etc.

Example: The below snipped can be copied into a notebook cell to update the various spark properties associated with the current kernel.

%%configure -f 
{
  "driverMemory": "3G",
  "driverCores" : "2",
  "executorMemory" : "3G",
  "executorCores" : "2",
  "numExecutors" : 5,
  "conf" : {
      "spark.kubernetes.driver.label.test": "test-label"
  }
}

Implementation Details

The below are the changes made at the high level:

  1. I have introduced a new API on JEG POST api/configure/<kernel_id> which accepts a payload similar to create kernel API. This API currently support updating the ["KERNEL_EXTRA_SPARK_OPTS", "KERNEL_LAUNCH_TIMEOUT"] env variables.
  2. The above API tries to restart the same Kernel with the updated configuration. This is done because we want to keep the kernel_id same and want to give a smooth end user experience.
  3. Once the old kernel goes away and a replacement comes up, we also need to refresh the ZMQ sockets to establish the connection with the new kernel so that existing active websocket connection from notebook / jupyterlab UI clients can continue to work. There hooks introduced to handle the same.
  4. Further, in order to complete the usual Jupyter Kernel messaging handshake, we fire the missing zmq messages from JEG to the websocket clients. Example: In order to mark the completion on the current cell, we need to send the exec_reply message and to mark the kernel idle, we need to kernel status=idle messages etc . These messages are pre-generated on the kernel and sent to JEG while making the API call to refresh the kernel.

I will update more details about the changes and add some diagrams.

Testing

Note

Opening this PR for some early feedback and discussion on the changes.

rahul26goyal commented 2 years ago

@kevin-bates : I am help is deciding the right terminology for the operation we are performing on the kernel using this new configure API:

  1. do we call it " refreshing kernel" or "re-configuring kernel" ?
  2. do we change the api to api/refresh/<kernel_id> and call this "refreshing kerne" operation?

we need use this term in both logs and response messages. pls give this some thought

kevin-bates commented 2 years ago

@kevin-bates : I am help is deciding the right terminology for the operation we are performing on the kernel using this new configure API:

  1. do we call it " refreshing kernel" or "re-configuring kernel" ?
  2. do we change the api to api/refresh/<kernel_id> and call this "refreshing kerne" operation?

we need use this term in both logs and response messages. pls give this some thought

I guess refresh seems a little easier to understand than reconfigure. Does this imply the magic name would change to %%refresh and does that conflict with existing magics? I think having the terminology match the magic name would be helpful.

I would also like to see the endpoint be under api/kernels rather than a sibling to api/kernels. Do you agree? If not, could you please help me understand why not? Is adding an endpoint under api/kernels violating some kind of convention?

kevin-bates commented 2 years ago

Hi @rahul26goyal - what is the status of this PR since it's been about 6 weeks since its last update and it seems there are a few things still to work out?