project-codeflare / codeflare

Simplifying the definition and execution, scaling and deployment of pipelines on the cloud.
https://codeflare.dev
Apache License 2.0
218 stars 35 forks source link

Support better integration between Ray and Spark in passing ObjectRef without actually moving data #32

Open klwuibm opened 3 years ago

klwuibm commented 3 years ago

Overview

As a Codeflare user, I want to use Ray and Spark alternately to execute my end-to-end ML jobs. Some steps might be executed more efficiently using Ray, while others using Spark. The plasma store in Ray seems to provide an efficient way to share ObjectRef between Ray and Spark. Currently, RayDP project supports from Spark to Ray in some limited way, by running Spark as a Ray actor. However, ObjectRef cannot be shared easily in both directions, Spark-2-Ray and Ray-2-Spark.

Acceptance Criteria

Questions

Assumptions

Reference

[Reference] I have opened an issue on the RayDP repo: https://github.com/oap-project/raydp/issues/164

raghukiran1224 commented 3 years ago

@klwuibm Suggest that you fill in the rest of the issue template? :)

raghukiran1224 commented 3 years ago

Thanks @klwuibm !

klwuibm commented 3 years ago

This feature can be supported via the Ray Datasets (currently on alpha with some missing methods, such as ray.data.from_spark() and ds.to_spark()). For example, to exchange data from Spark to Pandas, one can do ds = ray.data.from_spark() followed by pdf = ds.to_pandas(). Similarly, from Pandas to Spark, one can do ds = ray.data.from_pandas() followed by sdf = ds.to_spark().