ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

[Data] Repartition dataset by key(s) #42228

Open bdewilde opened 10 months ago

bdewilde commented 10 months ago

Description

For some use cases, it would be nice to be able to repartition a dataset by column(s) values, specified as key(s), such that all the rows in a given block would have the same values. This would echo the API for, say, Dataset.sort() or Dataset.groupby().

Use case

Just as repartitioning by n (number of partitions) is recommended as a way to control the splitting of files written to disk, so too would repartitioning by key(s). When combined with a custom filename provider, it would be in the same ballpark as writing data to disk in partitioned directories, as discussed here: https://github.com/ray-project/ray/issues/24879

It's possible to do ^ via Dataset.groupby().map_groups(), but it's clumsy, since map groups requires returned batches and the creation of a new dataset, while writing to disk does not.

wingkitlee0 commented 10 months ago

I am currently working on something similar, but focusing on reading partitioned dataset

Previously I also asked this #41378 : in my case, I would like to read a partitioned dataset, apply map_batch(.., batch_size=None) to each file, and save them (keeping the partition as how they were read). So technically no sort involved.