Closed Bowen0729 closed 2 years ago
Hi @Bowen0729
We have done some implementation to put Spark dataframe into ray's object store based on locality (see some of the code here: https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/orca/data/ray_xshards.py#L356). And we use this implementation in our Estimators for distributed PyTorch and TensorFlow training based on ray, which directly accepts spark dataframes. You may take a look or you can tell us what you want to do after reading spark dataframe into ray and we will try to help.
Thank you for your reply.
My point is that I want to start a ray cluster on bigdata cluster, and use some ray api, such as ray_df.to_dask, and then I can transform the spark dataframe to dask dataframe. Seems a little bit weird, becuase it is not the goal of the RayOnSpark, but I can't find other ways at present。
raydp can take care of my needs, and I try to start ray cluster on yarn via skein, but there are some issues because of the ray cannot able to read jars on hdfs, see https://github.com/oap-project/raydp/issues/199
I start a ray cluster on spark and read a table via spark, but how can I use this dataframe using ray? can ray read spark dataframe directly?