ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.96k stars 580 forks source link

feat(bigquery): general `approx_quantile` similar to `approx_median` but for arbitrary quantile #9541

Closed tswast closed 1 week ago

tswast commented 1 month ago

Is your feature request related to a problem?

There isn't a great way to use BigQuery's APPROX_QUANTILES feature to get values besides the approximate median in Ibis.

What is the motivation behind your request?

To mimic pandas describe, BigQuery DataFrames uses APPROX_QUANTILES to get an approximate 25th percentile, median, and 75th percentile.

Note: the percentiles are configurable in pandas, but unfortunately BigQuery SQL's # of bins approach makes it difficult to support arbitrary percentiles.

Describe the solution you'd like

From BigQuery DataFrames, perspective, it'd be great if there was an API to get evenly-spaced approximate quantiles, but perhaps a approx_quantile function that takes an integer from 0 - 100 would be more flexible? Or 0 to 1 to mimic pandas, but with the note that some backends like bigquery only support precision up to a certain point (maybe nearest 0.05 or nearest 0.01)?

What version of ibis are you running?

8.x, but working on 9.x upgrade

What backend(s) are you using, if any?

BigQuery

Code of Conduct

deepyaman commented 1 month ago

I think approx_quantile makes sense to expose (other backends like DuckDB also offer this).

@chloeh13q pointed out a bit of potential inconsistency, where we implement median for the BigQuery backend using the approx_quantile method, but don't call it approx_median; in that case, would it be more consistent to expose approx_ versions of both quantile and median?

I think it makes sense to revisit this next week once most of the maintainers are back from SciPy.

ncclementi commented 1 month ago

Tangential but ...

where we implement median for the BigQuery backend using the approx_quantile method, but don't call it approx_median ...

@deepyaman Can you point to where this happens?

We do have an Approximate Median supported using Approximate quantiales: see https://github.com/ibis-project/ibis/blob/b44dac2a5d0346ed0f3dbdc05597104f64e40779/ibis/backends/bigquery/compiler.py#L148-L149

and the exact Median it's not supported: https://github.com/ibis-project/ibis/blob/b44dac2a5d0346ed0f3dbdc05597104f64e40779/ibis/backends/bigquery/compiler.py#L38C1-L43C20

cpcloud commented 1 month ago

@tswast Thanks for the issue!

I think it makes sense to add an Column.approx_quantile method that mirrors the quantile method that we have now.

Any takers from the BigFrames team?