ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

Ray Data #48043

Open tmbdev opened 1 month ago

tmbdev commented 1 month ago

Description

Ray Data right now supports gzip for compression of shards.

It would be nice if it also supported lz4. While lz4 gives lower compression ratios, it is several times faster than gz for text compression/decompression.

Use case

I'm trying to maximize speed for I/O of very large text datasets.

Superskyyy commented 1 month ago

These would benefit from a generic interface to plugin the compression algorithms. And how about Zstd?

pcmoritz commented 1 month ago

Where is it missing at the moment? If you look at https://docs.ray.io/en/latest/data/loading-data.html#handling-compressed-files and https://arrow.apache.org/docs/python/generated/pyarrow.CompressedInputStream.html, it has both lz4 and zstd there :)