[Data] Repartition dataset by key(s)

Description

For some use cases, it would be nice to be able to repartition a dataset by column(s) values, specified as key(s), such that all the rows in a given block would have the same values. This would echo the API for, say, Dataset.sort() or Dataset.groupby().

Use case

Just as repartitioning by n (number of partitions) is recommended as a way to control the splitting of files written to disk, so too would repartitioning by key(s). When combined with a custom filename provider, it would be in the same ballpark as writing data to disk in partitioned directories, as discussed here: https://github.com/ray-project/ray/issues/24879

It's possible to do ^ via Dataset.groupby().map_groups(), but it's clumsy, since map groups requires returned batches and the creation of a new dataset, while writing to disk does not.

ray-project / ray

[Data] Repartition dataset by key(s) #42228

Description

Use case