bug 1904841: Fix glam_percentile udf

mozilla / bigquery-etl

Bigquery ETL

Mozilla Public License 2.0

241 stars 98 forks source link

Covers a case in which all bucket values are zero, and adds a test for it.

Here's a test comparing it with the old JS version of this udf: https://sql.telemetry.mozilla.org/queries/100986/source

Checklist for reviewer:

[ ] Commits should reference a bug or github issue, if relevant (if a bug is referenced, the pull request should include the bug number in the title).
[ ] If the PR comes from a fork, trigger integration CI tests by running the Push to upstream workflow and provide the <username>:<branch> of the fork as parameter. The parameter will also show up in the logs of the manual-trigger-required-for-fork CI task together with more detailed instructions.
[ ] If adding a new field to a query, ensure that the schema and dependent downstream schemas have been updated.
[ ] When adding a new derived dataset, ensure that data is not available already (fully or partially) and recommend extending an existing dataset in favor of creating new ones. Data can be available in the bigquery-etl repository, looker-hub or in looker-spoke-default.

For modifications to schemas in restricted namespaces (see CODEOWNERS):

[ ] Follow the change control procedure

┆Issue is synchronized with this Jira Task

Integration report for "Fix glam_percentile udf"

`sql.diff`

Click to expand!

```diff diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/mozfun/glam/percentile/udf.sql /tmp/workspace/generated-sql/sql/mozfun/glam/percentile/udf.sql --- /tmp/workspace/main-generated-sql/sql/mozfun/glam/percentile/udf.sql 2024-06-27 15:14:38.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/mozfun/glam/percentile/udf.sql 2024-06-27 15:14:33.000000000 +0000 @@ -13,23 +13,38 @@ AND pct <= 100, TRUE, ERROR('percentile must be a value between 0 and 100') - ) pct_ok + ) pct_ok, + SUM(value) AS total_value + FROM + UNNEST(histogram) ), keyed_cum_sum AS ( SELECT key, - SUM(value) OVER (ORDER BY CAST(key AS FLOAT64)) / SUM(value) OVER () AS cum_sum + IF( + total_value = 0, + 0, + SUM(value) OVER (ORDER BY CAST(key AS FLOAT64)) / SUM(value) OVER () + ) cum_sum + FROM + UNNEST(histogram), + check + ), + max_bucket AS ( + SELECT + MAX(CAST(key AS FLOAT64)) AS bucket FROM UNNEST(histogram) ) SELECT - CAST(key AS FLOAT64) + IF(total_value = 0, max_bucket.bucket, CAST(key AS FLOAT64)) FROM keyed_cum_sum, - check + check, + max_bucket WHERE check.pct_ok - AND cum_sum >= pct / 100 + AND (total_value = 0 OR cum_sum >= pct / 100) ORDER BY cum_sum LIMIT ```

Link to full diff

mozilla / bigquery-etl

bug 1904841: Fix glam_percentile udf #5854

Integration report for "Fix glam_percentile udf"

sql.diff

`sql.diff`