intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

[RayOnSpark] How to use a spark dataframe on ray? #12

Closed Bowen0729 closed 2 years ago

Bowen0729 commented 2 years ago

I start a ray cluster on spark and read a table via spark, but how can I use this dataframe using ray? can ray read spark dataframe directly?

hkvision commented 2 years ago

Hi @Bowen0729

We have done some implementation to put Spark dataframe into ray's object store based on locality (see some of the code here: https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/orca/data/ray_xshards.py#L356). And we use this implementation in our Estimators for distributed PyTorch and TensorFlow training based on ray, which directly accepts spark dataframes. You may take a look or you can tell us what you want to do after reading spark dataframe into ray and we will try to help.

Bowen0729 commented 2 years ago

Thank you for your reply.

My point is that I want to start a ray cluster on bigdata cluster, and use some ray api, such as ray_df.to_dask, and then I can transform the spark dataframe to dask dataframe. Seems a little bit weird, becuase it is not the goal of the RayOnSpark, but I can't find other ways at present。

Bowen0729 commented 2 years ago

raydp can take care of my needs, and I try to start ray cluster on yarn via skein, but there are some issues because of the ray cannot able to read jars on hdfs, see https://github.com/oap-project/raydp/issues/199