oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
293 stars 66 forks source link

Building Docker image to run Spark on Ray with RAPIDS #330

Open chenya-zhang opened 1 year ago

chenya-zhang commented 1 year ago

Hey there!

We are trying to experiment Spark on Ray with RAPIDS but not sure if Spark on Ray can support this case.

Here is the example Dockerfile for spark-rapids k8s setup: https://nvidia.github.io/spark-rapids/docs/get-started/Dockerfile.cuda

In the Dockerfile, we find the below commands to copy items under spark/:

COPY spark/jars /opt/spark/jars
COPY spark/bin /opt/spark/bin
COPY spark/sbin /opt/spark/sbin
COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY spark/examples /opt/spark/examples
COPY spark/kubernetes/tests /opt/spark/tests
COPY spark/data /opt/spark/data

If running pip install raydp-nightly, we can find pyspark/. Under pyspark, it has the below content.

/anaconda3/lib/python3.7/site-packages/pyspark# ls
__init__.py      context.py                java_gateway.py  rdd.py             statcounter.py
__pycache__      daemon.py                 join.py          rddsampler.py      status.py
_globals.py      data                      licenses         resource           storagelevel.py
_typing.pyi      examples                  ml               resultiterable.py  streaming
accumulators.py  files.py                  mllib            sbin               taskcontext.py
bin              find_spark_home.py        pandas           serializers.py     traceback_utils.py
broadcast.py     install.py                profiler.py      shell.py           util.py
cloudpickle      instrumentation_utils.py  py.typed         shuffle.py         version.py
conf.py          jars                      python           sql                worker.py

In this case, will there be concerns if we instead COPY pyspark/jars /opt/pyspark/jars or set SPARK_HOME to the existing .../pyspark installed by RayDP, 2) there is no /kubernetes/dockerfiles/spark/entrypoint.sh or /kubernetes/tests under pyspark/- I think they may not be required if we are able to launch Spark with RayDP on k8s.

Any suggestions or pointers would be very helpful, thanks!

carsonwang commented 1 year ago

I suggest you can start with a Ray image, install RayDP and pySpark in it and make sure Spark on Ray works on your K8s cluster first. Then you can try to run with other Spark plugins. If the plugin can work by setting a Spark configuration to include the jar, you can set it in RayDP when you init Spark like raydp.init_spark(..., configs={"key": "value"}).

However, we have no experience with spark-rapids. If it doesn't work, you may also want to check with the rapids team.

sameerz commented 1 year ago

Related discussion in https://github.com/NVIDIA/spark-rapids/discussions/8062