Open bdewilde opened 10 months ago
I am currently working on something similar, but focusing on reading partitioned dataset
Previously I also asked this #41378 : in my case, I would like to read a partitioned dataset, apply map_batch(.., batch_size=None) to each file, and save them (keeping the partition as how they were read). So technically no sort involved.
Description
For some use cases, it would be nice to be able to repartition a dataset by column(s) values, specified as key(s), such that all the rows in a given block would have the same values. This would echo the API for, say,
Dataset.sort()
orDataset.groupby()
.Use case
Just as repartitioning by
n
(number of partitions) is recommended as a way to control the splitting of files written to disk, so too would repartitioning by key(s). When combined with a custom filename provider, it would be in the same ballpark as writing data to disk in partitioned directories, as discussed here: https://github.com/ray-project/ray/issues/24879It's possible to do ^ via
Dataset.groupby().map_groups()
, but it's clumsy, since map groups requires returned batches and the creation of a new dataset, while writing to disk does not.