YARN as a resource manager

jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

https://jupyter-enterprise-gateway.readthedocs.io/en/latest/

Other

616 stars 223 forks source link

YARN as a resource manager #738

Open leinad87 opened 4 years ago

leinad87 commented 4 years ago

How shall I configure Jupyter EG to run a jupyter notebook on a yarn cluster? The idea is to be able to generate notebooks with limit resources, for example 5 cores and 8GB, and let YARN to select the best node in the cluster to run this notebook.

kevin-bates commented 4 years ago

Hi @leinad87 - thank you for your question.

What you really want is parameterized kernels, but we're not "there" yet since that requires coordination across the entire jupyter stack. In lieu of parameterized kernels, you'd need to create different kernelspecs for each combination of parameters.

There are two ways EG integrates with YARN, via Spark and via Dask (python only).

For Spark, you'd configure your kernelspec (kernel.json) to include the appropriate Spark/Yarn parameters in the SPARK_OPTS env entry that get passed to spark-submit.

For Dask, you'd configure your kernelspec (kernel.json) to include the appropriate Dask/Yarn parameters - presumably in the DASK_OPTS env entry.

leinad87 commented 4 years ago

Thank you @kevin-bates. I tried DASK option for python deployments, however it doesn't can comunicate back to the client. I'm using jupyter hub + jupyter enterprise.

Temporal solution: using pyspark on yarn cluster mode with driver on cluster mode. Once the spark context is initialized, stop it to free executors resources.

kevin-bates commented 4 years ago

Just to be clear, you're using jupyter hub + notebook server with the latter configured to hit a single enterprise gateway - is that correct?

Once the spark context is initialized, stop it to free executors resources.

I don't understand what this accomplishes. Could you please elaborate more on what you're trying to do and how your 'Temporal solution' is a solution at all?

leinad87 commented 4 years ago

I'm running jupyter hub for authentication purpose and single entry point for the users. Jupyter Hub uses Jupyter Gateway as spawner: c.Spawner.cmd = ['jupyterhub-singleuser', '--gateway-url=http://127.0.0.1:29128', '--GatewayClient.http_user={username}']. Jupyter Gateway shall deploy ipynb files on top of YARN. The idea is that YARN administrate cluster resources for both, pyspark and python jobs.

Tricky part: python kernel is the pyspark environment, but once the spark context is initialized, I close it to free resources.

kevin-bates commented 4 years ago

I see. So you're using spark/YARN to "distribute" the kernel into the cluster, but don't really need spark, so you close the spark context and continue running the kernel in YARN - interesting.

Yeah, it seems like you really want the launch using dask-yarn via the dask_python_yarn_remote kernelspec.

Have you tried looking into the application logs to see what kinds of issues (if any) are being reported?

kevin-bates commented 4 years ago

Sorry, I should have mentioned this previously. If you don't need a spark context, set your initialization-mode to none.

leinad87 commented 4 years ago

Thank you @kevin-bates for your support! The problem with initalization mode in none is that application status in YARN stays in ACCEPTED and after a while is killed.

kevin-bates commented 4 years ago

Oh yeah, that would be true. That slipped my mind. Sorry about that.

What you really want is the DASK kernelspec. Have you spent time troubleshooting that? It doesn't get a lot of use and I wonder if some recent changes broke something that is easily fixed.

leinad87 commented 4 years ago

Not yet, I'm stuck with impersonation and jupyter hub because native python kernel runs as root.