oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
290 stars 66 forks source link

Performance benchmark on RayDP v.s. Spark #340

Open chenya-zhang opened 1 year ago

chenya-zhang commented 1 year ago

Hi there,

In the talk "RayDP: Build Large-scale End-to-end Data Analytics and AI Pipelines Using Spark and Ray" https://youtu.be/ELSrR1Geqg4?t=819, @carsonwang mentioned that RayDP would have better performance.

We are curious which type of queries / workflows you run and your analysis on the performance differences.

Thanks a lot!

carsonwang commented 1 year ago

Hi @chenya-zhang , there is a plan to integrate RayDP with Gluten which offloads the sql operations to native engine such as Velox. For TPC-H or TPC-DS like benchmark, we observed more than 2x speedup. You can find more details from the Gluten project https://github.com/oap-project/gluten.

We are also running RayDP + XGBoost on Ray workflows and observed performance advantage over running XGBoost on Spark. We will share more once the data is ready to publish.

rishabh-dream11 commented 3 months ago

Hi @carsonwang, Can you please share the performance benchmark numbers for Ray + XGBoost vs XGboost on Spark.

rishabh-dream11 commented 3 months ago

@carsonwang Did the plan to integrate RayDP with Gluten materialize?