Docker image for sparkmagic

jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

https://jupyter-enterprise-gateway.readthedocs.io/en/latest/

Other

621 stars 222 forks source link

Docker image for sparkmagic #744

Closed praveen-kanamarlapudi closed 2 years ago

praveen-kanamarlapudi commented 5 years ago

Description

Hi,

We are using enterprise kernel gateway with kubernetes cluster. Is there a docker image for sparkmagic?

Environment

Enterprise Gateway Version 2.0.0
Platform: Kubernetes
Sparkmagic with livy

lresende commented 5 years ago

You should be able to use the IPython + Spark support to have a secure connection to Yarn/Spark without requiring sparkmagic.

More details in our YARN Cluster mode documentation.

praveen-kanamarlapudi commented 5 years ago

Thanks for the details @fresende .

Is it possible to setup 2 different kernel gateway endpoints for different kernels? Ex: spark related in yarn-cluster mode and all other in Kubernetes in single deployment of jupyter?

lresende commented 5 years ago

Having the same Jupyter instance connect to two gateways is not possible, as the --gateway-url is a global parameter.

Having said that, why would you need two gateway endpoints? You can configure one EG with different kernelspecs to cover this, where one kernel configuration would create kernels in spark on Kubernetes, but if you select a vanilla python kernel, then it would just create a new pod for that kernel.

When trying to integrate Kubernetes and non-Kubernetes components you will have a lot of headaches around network access/communications, and that's why currently, we are not supporting this hybrid environment integration.

Would that work for your scenario?

praveen-kanamarlapudi commented 5 years ago

Is it possible to have spark on Kubernetes but access data from existing yarn clusters?

kevin-bates commented 5 years ago

You'd have to make the data available to the kernel pod(s) - probably via some kind of persistent volume.

Seems like you'd be better off just having two EG installations - one on your YARN cluster, another in your k8s cluster. Do the kernels of those two environments need to share data?

praveen-kanamarlapudi commented 5 years ago

Yes, kernels of both the environments needs to share the data.

kevin-bates commented 5 years ago

Ok. I think you'll need to research filesystems that might have a presence in both of these domains (Kubernetes and YARN). For example, believe HDFS is the de-facto remote FS for YARN, are there persistent volume examples for HDFS in Kubernetes?

I would recommend starting with an approach that already works in one (probably YARN), then look into the other. Remember that in K8s, each kernelspec area has a pod template that can be customized with things like PVs, etc. Since some portion of that will need to be parameterized with things like a user identifier, etc, and those kinds of parameters can be conveyed for a kernel via env variables prefixed with KERNEL_ - which flow from the client, through EG, to the kernel launch (and into the image). So they're available as template parameters when instantiating the pod template.

praveen-kanamarlapudi commented 5 years ago

Thank you @kevin-bates for the info.

I am trying to run some spark code (do ETL) and feed the data to tensorflow cpu/gpu kernel.

I think searching for a persistent volume which can connect to both the systems might help. I will try this option (I am not sure if it's simple to add new persistent volume to existing yarn clusters).

I am trying to see if pyspark kernel can coexist with k8s cluster and access data from existing spark clusters. Is it easy to integrate sparkmagic with EKG?

kevin-bates commented 5 years ago

@praveenkanamarlapudi - I don't have any experience with sparkmagic. If it essentially behaves like a kernel, you'd need to probably implement a kernel launcher for it if it needs to run in the cluster rather than on the gateway node. However, my (basic) understanding is that Livy performs the remoting for you, so I'm not sure how useful EG would be.

Persistent volume terminology is kubernetes-specific. You must be using some kind of data storage on YARN already - which I previously figured would be HDFS. I wouldn't try to introduce a new mechanism to a working solution, but rather, see if that working mechanism can be utilized in the environment that requires a solution (presumably k8s).

cc: @lresende @akchinSTC for any Livy/sparkmagic knowledge/insights

kevin-bates commented 2 years ago

@praveenkanamarlapudi - it might be interesting to discuss your spark magic/livy needs with @rahul26goyal. Rahul and I have talked about these aspects and it seems like Livy and EG are trying to solve similar issues in different ways.

I'm going to close this issue for now (we can revisit that decision as necessary), but wanted to connect you with Rahul in case you two can collaborate on this. Thanks for opening the issue.