Open terraflops1048576 opened 1 month ago
Is writing a partitioned datasets the primary use case? if so, this may be related to #42228 and #42288
If I understand your example correctly, writing a dataset into groups (or partitions) will be like
for ds_by_group in ds.split_by_key("group_key"):
ds_by_group.write_parquet("target")
Afaik, the ray data write_* api are blocking in the current design.
In #42288, I proposed a solution by align the blocks with keys. In that case,
ds.repartition_by_key("group_key").write_parquet("target")
will write Ngroup
files.
Well, I would also like the ability to treat each group as a separate Dataset, for example applying map_batches to each, but yes, ultimately the use case I'm targeting right now is writing.
To add to this, I am looking for the similar feature for general data processing, not necessarily for model training, but to stream the data to each node by group (single or multiple keys). So, imagine I have g1, g2, g3, g4, g5 (all bank data of a given user), I want to be able to process each user independently (or even send some groups together efficiently). And send those groups to each node in one (or both) of the following options:
Assuming I have 1 node for each group, I can just do something like shards = ds.streaming_split(key="KeyColumn", n=5)
and each node would receive an iterable with a single group.
If you don't have enough nodes (n=3) for for each group, maybe send (g1,g2) to shard 1, and (g3) to shard 2, and (g3,g4) to shard 3, somewhat split according to the number of rows.
Description
Allow ray.data.Dataset to be grouped by and then split into separate Datasets by a column value. In particular, ray.data.Dataset should have a
split_by_key
function that splits the Dataset into a dict or list of separate Datasets based on a particular column value. This is basically the groupby of Pandas.Use case
Currently, the use case is trying to take a Ray Dataset and split it into shards by some column value to write to separate files using a Ray DataSink. Currently, this is not possible, because the groupby operation only returns a GroupedData, from which you have to use
map_groups
. The current solution is to write some custom file writing logic insidemap_groups
and callmaterialize()
on the resulting Dataset, which is not a good API and prevents other use cases, like passing different Datasets to different workers, for example.