Adds non-normalized aggregations to Glean metrics.
By default, GLAM aggregates histograms by normalizing on client_id, which means clients have the same weight in their contribution to the histogram's values, regardless of how many times they submitted pings containing histograms.
A non-normalized aggregation is a way to aggregate histograms in which submissions (instead of clients) have the same weight. This means that clients that send more submissions will have more weight on the final aggregation.
Non-normalized aggregations were added to legacy telemetry probes in https://github.com/mozilla/bigquery-etl/pull/3873 and now I'm adding them to Glean metrics.
While the first implementation duplicates the bucket_counts task, this one adds the non-normalized data to the existing tasks, which is likely to be more efficient because it will scan less data. If that proves true I'll back-port it to legacy telemetry.
Non-normalized data for Glean metrics will show up on GLAM starting from the day this gets merged. In other words, there will be no data backfill.
Checklist for reviewer:
[ ] Commits should reference a bug or github issue, if relevant (if a bug is referenced, the pull request should include the bug number in the title).
[ ] If the PR comes from a fork, trigger integration CI tests by running the Push to upstream workflow and provide the <username>:<branch> of the fork as parameter. The parameter will also show up
in the logs of the manual-trigger-required-for-fork CI task together with more detailed instructions.
[ ] If adding a new field to a query, ensure that the schema and dependent downstream schemas have been updated.
[ ] When adding a new derived dataset, ensure that data is not available already (fully or partially) and recommend extending an existing dataset in favor of creating new ones. Data can be available in the bigquery-etl repository, looker-hub or in looker-spoke-default.
For modifications to schemas in restricted namespaces (see CODEOWNERS):
Adds non-normalized aggregations to Glean metrics.
By default, GLAM aggregates histograms by normalizing on
client_id
, which means clients have the same weight in their contribution to the histogram's values, regardless of how many times they submitted pings containing histograms. A non-normalized aggregation is a way to aggregate histograms in which submissions (instead of clients) have the same weight. This means that clients that send more submissions will have more weight on the final aggregation.Non-normalized aggregations were added to legacy telemetry probes in https://github.com/mozilla/bigquery-etl/pull/3873 and now I'm adding them to Glean metrics. While the first implementation duplicates the
bucket_counts
task, this one adds the non-normalized data to the existing tasks, which is likely to be more efficient because it will scan less data. If that proves true I'll back-port it to legacy telemetry.Non-normalized data for Glean metrics will show up on GLAM starting from the day this gets merged. In other words, there will be no data backfill.
Checklist for reviewer:
<username>:<branch>
of the fork as parameter. The parameter will also show up in the logs of themanual-trigger-required-for-fork
CI task together with more detailed instructions.For modifications to schemas in restricted namespaces (see
CODEOWNERS
):┆Issue is synchronized with this Jira Task