Decoupling notebooks from computation?

ziedbouf commented 5 years ago

This is a not an issue per say but more on ideation around the jupyter stack. for the past few weeks I used Databricks and i like the idea of how they decouple notebooks from the compute infrastructure.

Also, i read this article published by Airbnb and the challenges they founds to scale their analytics infrastructure. It seems they had to deal with the same issue of decoupling notebooks from computation in order to streamline the data analytics operations.

Jupyter stacks in general helps to solve the same problem but with the cost of integrating those components together seems to be the stopping barrier

I would like to know if there are any initiatives in the same direction? in case we wish similair stack, how to combine all of those jupyter projects (jupyter hub + jupyter entreprise gatweay and jupyter Lab) together to get similar infrastructure?

kevin-bates commented 5 years ago

I briefly looked at the airbnb slides and what they're doing with RedSpot is pretty much what we're trying to address with our Kubernetes solution.

As you know, Kernel and Enterprise Gateway introduce a bring your own notebook model by decoupling the notebook files from the kernel computation. Enterprise Gateway takes this one step further by decoupling the kernels from a specific server. I.e., it introduces the capability of running individual kernels in individual containers.

In our Kubernetes solution, each kernelspec is seeded with a kernel-pod.yaml file that can be tuned according to the configuration.

With the existing Jupyter stack, you can have Jupyter Hub serve as your authenticator and Notebook launch framework such that the kernels launched on behalf of those Notebooks reside in a managed cluster (Kuberenetes, DockerSwarm, YARN).

Because Enterprise Gateway decouples the kernels from the launching server, you have the added capability of not constraining the Notebook container resources - since the resources used by each kernel are in their own container, not the container of the Notebook server. This enables data scientists and analysts to launch multiple notebooks from a given server simultaneously w/o consuming all the resources on the Notebook server.

Regarding Deep Learning model support, we're looking into Notebook scheduling where a Notebook is submitted after the analyst has confirmed its content. @lresende can lend more insight to that if you're interested.

I hope this helps. Let's continue this discussion since this is exactly the kind of feedback we want to hear. Thank you.

ziedbouf commented 5 years ago

That's what i would like to achieve by jupyters stack. However what i miss using the jupyters stack are the followings:

Scheduling notebooks on demand and if possible build a DAG for an end to end ETL/ model execution using jupyter stack (possible integration of notebooks with airflow )
Spinning up clusters on demand with the ability to attach notebooks to any running clusters (databricks is only limited to spark clusters, however the same concept could be generalized to attach to any clusters connected through jupyter entrerpsie gateway. e.g spark, tensorflow, or dask/ray)
Databricks smartly use spot instances capabilities on aws to decrease the cost of infrastructure.

@kevin-bates how do you think we can achieve those features using jupyter stack?

kevin-bates commented 5 years ago

@lresende can speak more about scheduling notebooks and ETL/model execution.

Your second bullet sounds like you want the notebook running on the newly spun up cluster. However, the title of this topic implies a separation of the notebook from the compute resources (i.e., the notebook's kernel). I will assume you really want to talk about the latter.

EG should not get into the business of physically spinning up clusters itself. However, its pluggable proxy framework does not preclude anyone writing a process proxy that manually spins up a cluster. Also keep in mind that various frameworks, like k8s, may "spin up a cluster" just by virtue of the parameters used in your existing kernel-pod.yaml file. In fact, their spark implementation will invoke a number of executor pods based on your parameters, so in that sense, its spinning up a set of pods.

If the jupyter stack was going to contain code to explicitly spin up clusters on behalf of requests, I would think that kind of thing would be done via JupyterHub - and that may even be possible today.

You should also note that there's a pending Notebook PR that could disrupt some of this. However, the goal of that PR is to introduce kernel providers that essentially perform their own management operations. We will be working to ensure the pluggable proxy framework continues to work in that redesign by making the underlying process launcher/management abstract as well. I want to point you to that PR because you might have some interest into where that's heading.

lresende commented 5 years ago

That's what i would like to achieve by jupyter_s stack. However what i miss using the jupyter_s stack are the followings:

Scheduling notebooks on demand and if possible build a DAG for an end to end ETL/ model execution using jupyter stack (possible integration of notebooks with airflow )

Indeed a direction we are seeing a few folks go. I have started building a simple scheduler to run notebooks remotely using EG, but I have seen other more sophisticated schedulers coming up such as paperboy and will try to get that integrated with EG.

Spinning up clusters on demand with the ability to attach notebooks to any running clusters (databricks is only limited to spark clusters, however the same concept could be generalized to attach to any clusters connected through jupyter entrerpsie gateway. e.g spark, tensorflow, or dask/ray)

We do this today in a Kubernetes environment, where you can start with JupyterHub for bringing up your Notebook environment and use EG to enable kernels to be started as independent pods. We have also created a notion of one click submit to run notebooks on remote environments (see Scaling interactive workloads across Kubernetes Cluster )

Databricks smartly use spot instances capabilities on aws to decrease the cost of infrastructure.

We don't have plan to manage cluster hardware/resources, but this can easily be done on a private/public cloud container environment.

ziedbouf commented 5 years ago

Closed this by mistake. For the second point, i think it will be perfect to have clear idea on how to integrate third party components for managing their clusters.

@lresende thanks for sharing the paperboy projects, I am excited to give it a try.

lresende commented 5 years ago

For the second point, what i mean is the ability to proxy through jupyter entreprise gateway to any running clusters.

Yes, today you can do that for existing clusters, like you can define kernelspecs for multiple Spark/YARN, Kubernetes or Conductor (and others that might be added) clusters and start notebook that will launch kernels on these environments. What we don't do today, and mostly don't plan to do is add support to manage cluster/environment creation like Cloud vendors do.

kevin-bates commented 5 years ago

Great discussion. Does anyone have a reason to keep this open?

ziedbouf commented 5 years ago

From my end i think it's fine we can close, i am reading more on the proposed solution by @lresende, and i am exploring more the codebase to see how this can be used.

jupyter-server / enterprise_gateway

Decoupling notebooks from computation? #509