Closed dzeber closed 1 month ago
@dzeber Is there a way to add a sample ID to this table using any kind of sampling technique? As the table grows over time, having a sample ID could be used for quick data sampling. Alternatively, would it be more practical to randomly sample the data each time we query from this table?
@dzeber Is there a way to add a sample ID to this table using any kind of sampling technique?
As discussed, we can't sample on client ID the way we do in other datasets since there is no client identifier. The main unit of analysis for this table is at the level of serp impressions/events, and the table has 1 row per event. The easiest way to sample is probably to do it in SQL at query time.
sql.diff
⚠️ Only part of the diff is displayed.
sql.diff
⚠️ Only part of the diff is displayed.
@curtismorales looks like the CI failed with
400 Table *****************************.serp_category_name does not have a schema
I'm not sure what to make of this. This is the table from https://github.com/mozilla/private-bigquery-etl/pull/411
sql.diff
⚠️ Only part of the diff is displayed.
Moved to https://github.com/mozilla/private-bigquery-etl/pull/426 to address CI failure
Flattens serp_categorization data into 1 row per event and expands the extras fields into columns. Also maps the category IDs to names.
Checklist for reviewer:
<username>:<branch>
of the fork as parameter. The parameter will also show up in the logs of themanual-trigger-required-for-fork
CI task together with more detailed instructions.For modifications to schemas in restricted namespaces (see
CODEOWNERS
):┆Issue is synchronized with this Jira Task