ENH: Add support for multi-column quantiles of DataFrame

charlesbluca commented 3 years ago

Is your feature request related to a problem?

For dataframes, Pandas currently only supports per-column quantiles; that is, given df[['c', 'a']].quantile(...), Pandas will compute the individual quantiles for columns c and a:

>>> df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]})
>>> df[['c', 'a']].quantile([0, 0.5, 1])
       c     a
0.0  0.0   0.0
0.5  2.0   2.0
1.0  5.0  12.0

It would be nice if Pandas also supported multi-column quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:

In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})
In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]: 
     a  b
0.0  1  1
0.5  1  3
1.0  1  5

Describe the solution you'd like

I imagine the addition of multi-column quantiles support could happen in two ways:

the addition of a default kwarg to DataFrame.quantile to specify whether or not we want multi-column quantiles
the addition of a new method to compute multi-column quantiles independent from the logic of quantile

In either case, my preference here would be to have this functionality accessible via DataFrame.quantiles, to maintain consistency with cuDF.

API breaking implications

I can't think of any breakages this would cause, as long as any direct changes to quantile ensure that the original behavior is maintained by default.

Describe alternatives you've considered

This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.

Additional context

If this functionality were added, along with a multi-columnar searchsorted, it would enable Dask dataframes to compute sort_values with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.

mzeitlin11 commented 3 years ago

Thanks for the request @charlesbluca! At first glance, this API seems somewhat confusing because it sounds like a combination of 2 distinct ops -> computing the quantile on the first arg of the list, then an indexing operation with other included columns. For the average user, an alternative API might be one which allows returning the indices of the requested quantiles (analogous to the relationship between max and idxmax). This might also be implemented with something like a return_indices argument?

One other question here would be how to handle cases where quantiles don't evenly line up with values (so there is no corresponding index). Might have to restrict interpolation argument to lower, higher, nearest.

mzeitlin11 commented 3 years ago

Another consideration would be how much faster a specific implementation compared to the alternative of just finding the indices with an equality check.

rjzamora commented 3 years ago

Sorry for taking so long to chime in here @charlesbluca. Thanks for raising this!

It would be nice if Pandas also supported "multi-column" quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would:

compute the quantiles for column c using c's quantiles as an index, select the corresponding rows of column a

I may be misunderstanding, but I am fairly certain that this is not what cudf.DataFrame.quatiles does. Rather, it computes to coupled quantiles of all columns in the DataFrame (not the quantiles of the first columns). There is no "indexing" workaround in pandas. The only workaround is to convert all columns to a single Series of tuples, which is very slow.

[EDIT] Ah - I guess this indexing trick sort-of works if you peform a lexicographical sort by all columns first?

charlesbluca commented 3 years ago

Ah yes @rjzamora raises a good point - my original example doesn't highlight this, but cuDF's multi-column quantiles does compute the quantiles for the dataframe after it is lexicographically sorted by all columns; this example makes that more obvious:

In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})

In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]: 
     a  b
0.0  1  1
0.5  1  3
1.0  1  5

Apologies for the mislead @mzeitlin11 - it looks like in this case, we would need more than just the indices of a single column quantile operation to compute this.

Might have to restrict interpolation argument to lower, higher, nearest.

This is the exact restriction placed on cuDF's quantiles:

https://github.com/rapidsai/cudf/blob/68c56b7013e0a4e9cf4b420a11e476112a6655c0/python/cudf/cudf/core/dataframe.py#L5993-L6005

pandas-dev / pandas