ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[Feature] [Xlang] Arrow zerocopy deserialization #20241

Closed kira-lin closed 2 years ago

kira-lin commented 2 years ago

Search before asking

Description

Ray dataset uses Arrow as data format. In python, Arrow data is serialized by pickle5, which supports zerocopy deserialization. But workers in other languages cannot read it. If users want to pass Arrow data across language, they would have to use bytes format, with which zerocopy deserialization is not possible.

Arrow IPC format supports zerocopy deserialization, and is language-agnostic. Using this format can enable xlang zerocopy deserialization of Arrow data.

Use case

In RayDP, spark dataframes can be converted to a ray dataset. It is first converted to Arrow format in java, and then written into ray object store. Currently, we have to use bytes format when writing to ray, and that incurs a copy in deserialization in python. With this feature, copy is no longer needed and read performance is better.

Related issues

No response

Are you willing to submit a PR?

jovany-wang commented 2 years ago

Nice proposal.

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!