ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.85k stars 579 forks source link

feat(bigquery): support for `WITH AGGREGATION_THRESHOLD` in aggregations #8903

Open tswast opened 4 months ago

tswast commented 4 months ago

Is your feature request related to a problem?

BigQuery customers can set aggregation threshold analysis rules to protect privacy-sensitive data. If they have setup such rules then they need to use a WITH AGGREGATION_THRESHOLD clause when querying the table.

SELECT WITH AGGREGATION_THRESHOLD
  test_id, COUNT(DISTINCT last_name) AS student_count
FROM mydataset.ExamView
GROUP BY test_id;

from https://cloud.google.com/bigquery/docs/analysis-rules#view_in_privacy_query

Describe the solution you'd like

A new parameter to Table.aggregate and/or Table.groupby would seem to be the right place to add this.

Alternatively, maybe a new pre-groupby table expression type for a thresholded table.

What version of ibis are you running?

N/A

What backend(s) are you using, if any?

BigQuery

Code of Conduct

kszucs commented 4 months ago

Can we think of it as an arbitrary query setting similar for example to what clickhouse has?

tswast commented 4 months ago

Can we think of it as an arbitrary query setting similar for example to what clickhouse has?

I haven't used clickhouse, but it looks pretty similar. Clickhouse looks like it supports general key/values, but there's an extra layer of syntax in BigQuery, with each feature enablement having its own sub-options.

There is a related (sub)query-scoped option specifically for privacy options via SELECT [ WITH differential_privacy_clause ], which is documented as part of the general SELECT syntax.

I don't actually see AGGREGATION_THRESHOLD listed there, but from the examples, the AGGREGATION_THRESHOLD clause looks like it'd be parsed and scoped to the (sub)query in the same way.