ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.74k stars 5.55k forks source link

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

Open vipese-idoven opened 1 month ago

vipese-idoven commented 1 month ago

Description

The current implementation only allows to create ArrowVariableShapedTensorArray objects with a maximum number of (2^31)-1 elements because it uses PyArrow's ListArray in ray.air.util.tensor_extention.arrow L812 which uses 32-bit encoding for indexing. Thus, storing some types of data like long time-series which contain more elements than with 32-bit encoding causes overflow.

Providing the possibility to replace ListArray with Pyarrow LargeListArray would allow to store arrays with up to (2^63)-1 elements. (Note: this would also require to change the OFFSET_DTYPE in L722)

Use case

The goal is to be able to store long time-series in arrow format (like long audios, or audios with high sample frequencies).

anyscalesam commented 1 month ago

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

vipese-idoven commented 1 month ago

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

I've used Ray Data for batch processing, which can turn into very long signals. I was hoping to store the pre-processed data into arrow format for later segmentation and classification to avoid pre-processing again (or doing it on the fly).

scottjlee commented 1 month ago

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

vipese-idoven commented 1 month ago

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

Awesome! Happy to help if need be

anyscalesam commented 1 month ago

@vipese-idoven lovely - can you take a look at the PR and failing release test? @terraflops1048576 is the PR author so please connect with him as required.

vipese-idoven commented 1 month ago

Will do! @terraflops1048576 any chance we can connect offline and discuss this?

terraflops1048576 commented 2 weeks ago

Sorry, I haven't kept up with this project. I don't think there's much to discuss here -- though I'm open to answering questions about the codebase as I understand it (but you probably will get more accurate answers from the Anyscale people here).

The basic problem is that changing the ArrowTensorArray to use Arrow LargeLists is a relatively trivial matter -- just changing the type to pyarrow.large_list and changing a constant here or there. However, this breaks the release tests because the release tests use data stored in the old format, which is incompatible with the new format.

So a "proper" fix for this has to be the above changes, but however with a way to automatically convert or otherwise parse the old format. one way to do this might be to add in an ArrowLargeTensorArray and use that by default everywhere and write in a bunch of conversion code, but this seems like it'll make the code very messy.