rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 903 forks source link

[BUG] split/slice APIs do not align with partitioning APIs #4607

Open jrhemstad opened 4 years ago

jrhemstad commented 4 years ago

Describe the bug

Partitioning APIs that partition a table into n partitions, like hash_partition or round_robin_partition, return a single table and a vector of n+1 offsets that points to the beginning of each partition and where the size of any partition i can be determined by offsets[i+1] - offsets[i].

For example:

partitioned_table = {7}, {}, {3, 8, 9}, {42};
offsets = [0, 1, 1, 4, 5]

I would expect to be able to trivially pass the output of a partitioning API into an API like split or slice in order to get a vector of zero-copy table_views for each partition.

However, this is not possible because the expected inputs for split or slice are incompatible with the offsets vector returned from a partitioning API.

slice expects a vector of index pairs:

 input:   [{10, 12, 14, 16, 18, 20, 22, 24, 26, 28},
           {50, 52, 54, 56, 58, 60, 62, 64, 66, 68}]
 indices: {1, 3, 5, 9, 2, 4, 8, 8}
 output:  [{{12, 14}, {20, 22, 24, 26}, {14, 16}, {}},
           {{52, 54}, {60, 22, 24, 26}, {14, 16}, {}}]

split expects a vector of the split points:

 input:   {10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
 splits:  {2, 5, 9}
 output:  {{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}}

Neither of these are trivially compatible with the output of a partitioning API.

split is the closest. You can obtain the splits vector from the offsets vector by dropping the first and last element from offsets. However, that is inconvenient.

Expected behavior

There should be an API that allows naively passing in the vector of offsets from a partitioning API and it returns a vector of zero-copy views for each partition.

harrism commented 4 years ago

Agree.

jrhemstad commented 4 years ago

Agree.

New API or change split?

harrism commented 4 years ago

There should be an API that allows naively passing in the vector of offsets from a partitioning API and it returns a vector of zero-copy views for each partition.

Was agreeing with your final statement, which didn't specify a choice. :) I would change split. Doing so would also make split slightly more versatile -- e.g. it could be used to skip the beginning and/or end of a table when splitting.

but of course we need to check for any existing users of split before we change it...

harrism commented 4 years ago

Should at least fix the docs for now.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

wence- commented 2 years ago

FWIW, offsets of length n+1 is most convenient if interfacing with MPI-like libraries, and is also consistent with most ragged-array CSR-style data-structures, so I would argue for rationalizing on that.

(If you need to return sparse partitions then a pair of counts and offsets is probably necessary.)

davidwendt commented 2 years ago

Just want to link https://github.com/rapidsai/cudf/issues/11223 to this issue as well.