Open jrhemstad opened 4 years ago
Agree.
Agree.
New API or change split
?
There should be an API that allows naively passing in the vector of offsets from a partitioning API and it returns a vector of zero-copy views for each partition.
Was agreeing with your final statement, which didn't specify a choice. :) I would change split. Doing so would also make split slightly more versatile -- e.g. it could be used to skip the beginning and/or end of a table when splitting.
but of course we need to check for any existing users of split before we change it...
Should at least fix the docs for now.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
FWIW, offsets
of length n+1
is most convenient if interfacing with MPI-like libraries, and is also consistent with most ragged-array CSR-style data-structures, so I would argue for rationalizing on that.
(If you need to return sparse partitions then a pair of counts
and offsets
is probably necessary.)
Just want to link https://github.com/rapidsai/cudf/issues/11223 to this issue as well.
Describe the bug
Partitioning APIs that partition a table into
n
partitions, likehash_partition
orround_robin_partition
, return a single table and a vector ofn+1
offsets that points to the beginning of each partition and where the size of any partitioni
can be determined byoffsets[i+1] - offsets[i]
.For example:
I would expect to be able to trivially pass the output of a partitioning API into an API like
split
orslice
in order to get a vector of zero-copytable_view
s for each partition.However, this is not possible because the expected inputs for
split
orslice
are incompatible with theoffsets
vector returned from a partitioning API.slice
expects a vector of index pairs:split
expects a vector of the split points:Neither of these are trivially compatible with the output of a partitioning API.
split
is the closest. You can obtain thesplits
vector from theoffsets
vector by dropping the first and last element fromoffsets
. However, that is inconvenient.Expected behavior
There should be an API that allows naively passing in the vector of offsets from a partitioning API and it returns a vector of zero-copy views for each partition.