jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
621 stars 221 forks source link

Exchange user specific files between EG kernel and Notebook container, and restrict access of files between Notebook containers. #676

Open ArindamHalder7 opened 5 years ago

ArindamHalder7 commented 5 years ago

Hi @lresende

I am creating new issue on continuation of the discussion #670 .

Here is my scenario.

  1. I am running EG 1.2 and Spark 2.3.1.
  2. I want to run multiple Notebook container in different system/VM which will connect to EG using websocket.
  3. EG kernel are running on host, Not of docker.
  4. User Notebook containers have i/p files. These i/p files will be used in the program written by the user. User also will generate some o/p files which will be accessed in the users own Notebook container.

How user (Notebook container) can work with own i/p and o/p files with EG kernel (cluster/non cluster) configured in different system/VM. Also how can i restrict access of user specific files between multiple users in the same scenario.

One more query: If any Notebook container tries to install any new python package with "conda install" or "pip install" , will this package get installed in all the nodes of the clustered KG? Although I have not checked this till now, but have plan to test this in later point of time.

Let me know if you more information on this.

kevin-bates commented 5 years ago

There needs to be a way to isolate the user to the kernel's "reach".

In on-prem environments, this is typically accomplished via the permissions - where kerberos is used to perform impersonation via the KERNEL_USERNAME value. These tend to be Hadoop/YARN envs so HDFS is the preferred mechanism for make files available to the kernels.

In container-based environments, this isolation is accomplished via containerized kernels - coupled with user-specific mounts. In this case, you'd want to run EG in docker (or docker swarm) and each kernel container would mount user-specific volumes to the kernel container that are also in use by the docker notebook instance. This would likely require modification of the docker launcher to include the user-specific mounts.

The configuration you describe is kind of a hybrid approach. Since you want to use Spark, then I recommend you use Kubernetes for complete containerization since it supports Spark 2.4. If Kubernetes is not an option then you might try using YARN w/ kerberos, although I'm not knowledgable enough to tell you if HDFS can be accessed from docker containers for the file sharing you need.

cc: @lresende @akchinSTC