Closed charlesbluca closed 2 years ago
Thanks for the request @charlesbluca! At first glance, this API seems somewhat confusing because it sounds like a combination of 2 distinct ops -> computing the quantile on the first arg of the list, then an indexing operation with other included columns. For the average user, an alternative API might be one which allows returning the indices of the requested quantiles (analogous to the relationship between max
and idxmax
). This might also be implemented with something like a return_indices
argument?
One other question here would be how to handle cases where quantiles don't evenly line up with values (so there is no corresponding index). Might have to restrict interpolation
argument to lower
, higher
, nearest
.
Another consideration would be how much faster a specific implementation compared to the alternative of just finding the indices with an equality check.
Sorry for taking so long to chime in here @charlesbluca. Thanks for raising this!
It would be nice if Pandas also supported "multi-column" quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would:
compute the quantiles for column c using c's quantiles as an index, select the corresponding rows of column a
I may be misunderstanding, but I am fairly certain that this is not what cudf.DataFrame.quatiles
does. Rather, it computes to coupled quantiles of all columns in the DataFrame (not the quantiles of the first columns). There is no "indexing" workaround in pandas. The only workaround is to convert all columns to a single Series of tuples, which is very slow.
[EDIT] Ah - I guess this indexing trick sort-of works if you peform a lexicographical sort by all columns first?
Ah yes @rjzamora raises a good point - my original example doesn't highlight this, but cuDF's multi-column quantiles
does compute the quantiles for the dataframe after it is lexicographically sorted by all columns; this example makes that more obvious:
In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})
In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]:
a b
0.0 1 1
0.5 1 3
1.0 1 5
Apologies for the mislead @mzeitlin11 - it looks like in this case, we would need more than just the indices of a single column quantile
operation to compute this.
Might have to restrict interpolation argument to lower, higher, nearest.
This is the exact restriction placed on cuDF's quantiles
:
Is your feature request related to a problem?
For dataframes, Pandas currently only supports per-column quantiles; that is, given
df[['c', 'a']].quantile(...)
, Pandas will compute the individual quantiles for columnsc
anda
:It would be nice if Pandas also supported multi-column quantiles; that is, given
df[['c', 'a']].quantiles(...)
, Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:Describe the solution you'd like
I imagine the addition of multi-column quantiles support could happen in two ways:
DataFrame.quantile
to specify whether or not we want multi-column quantilesquantile
In either case, my preference here would be to have this functionality accessible via
DataFrame.quantiles
, to maintain consistency with cuDF.API breaking implications
I can't think of any breakages this would cause, as long as any direct changes to
quantile
ensure that the original behavior is maintained by default.Describe alternatives you've considered
This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.
Additional context
If this functionality were added, along with a multi-columnar
searchsorted
, it would enable Dask dataframes to computesort_values
with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.