mozilla / bigquery-etl

Bigquery ETL
https://mozilla.github.io/bigquery-etl
Mozilla Public License 2.0
253 stars 100 forks source link

Adding table for merino export to gcs #6054

Closed chelseybeck closed 1 month ago

chelseybeck commented 1 month ago

Creating the table for Merino extracts to GCS

Checklist for reviewer:

For modifications to schemas in restricted namespaces (see CODEOWNERS):

┆Issue is synchronized with this Jira Task

dataops-ci-bot commented 1 month ago

Integration report for "adding table for merino export to gcs"

sql.diff

Click to expand! ```diff Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived: merino_newtab_aggregates_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/checks.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/checks.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/checks.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/checks.sql 2024-08-12 18:14:24.000000000 +0000 @@ -0,0 +1,16 @@ +-- macro checks + +#fail +{{ not_null(["scheduled_corpus_item_id"]) }} + +#fail +{{ is_unique(["scheduled_corpus_item_id"]) }} + +#fail +{{ not_null(["impression_count"]) }} + +#fail +{{ not_null(["click_count"]) }} + +#fail +{{ min_row_count(1) }} diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/metadata.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/metadata.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/metadata.yaml 2024-08-12 18:12:42.000000000 +0000 @@ -0,0 +1,15 @@ +friendly_name: Merino Newtab Aggregates V1 +description: |- + Please provide a description for the query +owners: [] +labels: {} +bigquery: null +workgroup_access: +- role: roles/bigquery.dataViewer + members: + - workgroup:mozilla-confidential +references: + checks.sql: + - .. + query.sql: + - moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/query.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/query.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/query.sql 2024-08-12 18:14:24.000000000 +0000 @@ -0,0 +1,47 @@ +WITH deduplicated_pings AS ( + SELECT + submission_timestamp, + document_id, + events, + FROM + `moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1` + WHERE + submission_timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) + QUALIFY + ROW_NUMBER() OVER ( + PARTITION BY + DATE(submission_timestamp), + document_id + ORDER BY + submission_timestamp DESC + ) = 1 +), +flattened_newtab_events AS ( + SELECT + document_id, + submission_timestamp, + unnested_events.name AS event_name, + mozfun.map.get_key( + unnested_events.extra, + 'scheduled_corpus_item_id' + ) AS scheduled_corpus_item_id, + mozfun.map.get_key(unnested_events.extra, 'position') AS position, + COUNT(1) OVER (PARTITION BY document_id, unnested_events.name) AS user_event_count + FROM + deduplicated_pings, + UNNEST(events) AS unnested_events + --filter to Pocket events + WHERE + unnested_events.category = 'pocket' + AND unnested_events.name IN ('impression', 'click', 'save', 'dismiss') + --keep only data with a non-null scheduled corpus item ID + AND (mozfun.map.get_key(unnested_events.extra, 'scheduled_corpus_item_id') IS NOT NULL) +) +SELECT + scheduled_corpus_item_id, + SUM(CASE WHEN event_name = 'impression' THEN 1 ELSE 0 END) AS impression_count, + SUM(CASE WHEN event_name = 'click' THEN 1 ELSE 0 END) AS click_count +FROM + flattened_newtab_events +GROUP BY + 1; diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/schema.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/schema.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/merino_newtab_aggregates_v1/schema.yaml 2024-08-12 18:10:25.000000000 +0000 @@ -0,0 +1,10 @@ +fields: +- mode: NULLABLE + name: scheduled_corpus_item_id + type: STRING +- mode: NULLABLE + name: impression_count + type: INTEGER +- mode: NULLABLE + name: click_count + type: INTEGER \ No newline at end of file ```

Link to full diff

dataops-ci-bot commented 1 month ago

Integration report for "updating dag name"

sql.diff

Click to expand! ```diff Only in /tmp/workspace/generated-sql/dags/: bqetl_merino_newtab_extract_to_gcs.py diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py /tmp/workspace/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py --- /tmp/workspace/main-generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py 2024-08-12 19:05:56.000000000 +0000 @@ -0,0 +1,81 @@ +# Generated via https://github.com/mozilla/bigquery-etl/blob/main/bigquery_etl/query_scheduling/generate_airflow_dags.py + +from airflow import DAG +from airflow.sensors.external_task import ExternalTaskMarker +from airflow.sensors.external_task import ExternalTaskSensor +from airflow.utils.task_group import TaskGroup +import datetime +from operators.gcp_container_operator import GKEPodOperator +from utils.constants import ALLOWED_STATES, FAILED_STATES +from utils.gcp import bigquery_etl_query, bigquery_dq_check + +docs = """ +### bqetl_merino_newtab_extract_to_gcs + +Built from bigquery-etl repo, [`dags/bqetl_merino_newtab_extract_to_gcs.py`](https://github.com/mozilla/bigquery-etl/blob/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py) + +#### Description + +Aggregates Newtab engagement data that lands in a GCS bucket for Merino recommendations. + +#### Owner + +cbeck@mozilla.com + +#### Tags + +* impact/tier_1 +* repo/bigquery-etl +""" + + +default_args = { + "owner": "cbeck@mozilla.com", + "start_date": datetime.datetime(2024, 8, 12, 0, 0), + "end_date": None, + "email": ["cbeck@mozilla.com", "gkatre@mozilla.com"], + "depends_on_past": False, + "retry_delay": datetime.timedelta(seconds=300), + "email_on_failure": True, + "email_on_retry": False, + "retries": 2, +} + +tags = ["impact/tier_1", "repo/bigquery-etl"] + +with DAG( + "bqetl_merino_newtab_extract_to_gcs", + default_args=default_args, + schedule_interval="*/20 * * * *", + doc_md=docs, + tags=tags, +) as dag: + + checks__fail_telemetry_derived__newtab_merino_extract__v1 = bigquery_dq_check( + task_id="checks__fail_telemetry_derived__newtab_merino_extract__v1", + source_table="newtab_merino_extract_v1", + dataset_id="telemetry_derived", + project_id="moz-fx-data-shared-prod", + is_dq_check_fail=True, + owner="cbeck@mozilla.com", + email=["cbeck@mozilla.com", "gkatre@mozilla.com"], + depends_on_past=False, + task_concurrency=1, + retries=0, + ) + + telemetry_derived__newtab_merino_extract__v1 = bigquery_etl_query( + task_id="telemetry_derived__newtab_merino_extract__v1", + destination_table="newtab_merino_extract_v1", + dataset_id="telemetry_derived", + project_id="moz-fx-data-shared-prod", + owner="cbeck@mozilla.com", + email=["cbeck@mozilla.com", "gkatre@mozilla.com"], + date_partition_parameter=None, + depends_on_past=False, + task_concurrency=1, + ) + + checks__fail_telemetry_derived__newtab_merino_extract__v1.set_upstream( + telemetry_derived__newtab_merino_extract__v1 + ) Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived: newtab_merino_extract_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql 2024-08-12 19:03:15.000000000 +0000 @@ -0,0 +1,16 @@ +-- macro checks + +#fail +{{ not_null(["scheduled_corpus_item_id"]) }} + +#fail +{{ is_unique(["scheduled_corpus_item_id"]) }} + +#fail +{{ not_null(["impression_count"]) }} + +#fail +{{ not_null(["click_count"]) }} + +#fail +{{ min_row_count(1) }} diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml 2024-08-12 19:01:27.000000000 +0000 @@ -0,0 +1,28 @@ +friendly_name: Newtab Merino Extract +description: |- + Aggregated Newtab events for Merino recommendations. + See https://mozilla-hub.atlassian.net/browse/MC-1256 +owners: +- cbeck@mozilla.com +- gkatre@mozilla.com +labels: + owner: cbeck + dag: bqetl_merino_newtab_extract_to_gcs + owner1: cbeck + owner2: gkatre +scheduling: + dag_name: bqetl_merino_newtab_extract_to_gcs + date_partition_parameter: null +bigquery: + time_partitioning: null + range_partitioning: null + clustering: null +workgroup_access: +- role: roles/bigquery.dataViewer + members: + - workgroup:mozilla-confidential +references: + checks.sql: + - .. + query.sql: + - moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql 2024-08-12 18:58:47.000000000 +0000 @@ -0,0 +1,47 @@ +WITH deduplicated_pings AS ( + SELECT + submission_timestamp, + document_id, + events, + FROM + `moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1` + WHERE + submission_timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) + QUALIFY + ROW_NUMBER() OVER ( + PARTITION BY + DATE(submission_timestamp), + document_id + ORDER BY + submission_timestamp DESC + ) = 1 +), +flattened_newtab_events AS ( + SELECT + document_id, + submission_timestamp, + unnested_events.name AS event_name, + mozfun.map.get_key( + unnested_events.extra, + 'scheduled_corpus_item_id' + ) AS scheduled_corpus_item_id, + mozfun.map.get_key(unnested_events.extra, 'position') AS position, + COUNT(1) OVER (PARTITION BY document_id, unnested_events.name) AS user_event_count + FROM + deduplicated_pings, + UNNEST(events) AS unnested_events + --filter to Pocket events + WHERE + unnested_events.category = 'pocket' + AND unnested_events.name IN ('impression', 'click', 'save', 'dismiss') + --keep only data with a non-null scheduled corpus item ID + AND (mozfun.map.get_key(unnested_events.extra, 'scheduled_corpus_item_id') IS NOT NULL) +) +SELECT + scheduled_corpus_item_id, + SUM(CASE WHEN event_name = 'impression' THEN 1 ELSE 0 END) AS impression_count, + SUM(CASE WHEN event_name = 'click' THEN 1 ELSE 0 END) AS click_count +FROM + flattened_newtab_events +GROUP BY + 1; diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml 2024-08-12 18:58:47.000000000 +0000 @@ -0,0 +1,10 @@ +fields: +- mode: NULLABLE + name: scheduled_corpus_item_id + type: STRING +- mode: NULLABLE + name: impression_count + type: INTEGER +- mode: NULLABLE + name: click_count + type: INTEGER ```

Link to full diff

dataops-ci-bot commented 1 month ago

Integration report for "formatting sql"

sql.diff

Click to expand! ```diff Only in /tmp/workspace/generated-sql/dags/: bqetl_merino_newtab_extract_to_gcs.py diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py /tmp/workspace/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py --- /tmp/workspace/main-generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py 2024-08-12 19:14:24.000000000 +0000 @@ -0,0 +1,81 @@ +# Generated via https://github.com/mozilla/bigquery-etl/blob/main/bigquery_etl/query_scheduling/generate_airflow_dags.py + +from airflow import DAG +from airflow.sensors.external_task import ExternalTaskMarker +from airflow.sensors.external_task import ExternalTaskSensor +from airflow.utils.task_group import TaskGroup +import datetime +from operators.gcp_container_operator import GKEPodOperator +from utils.constants import ALLOWED_STATES, FAILED_STATES +from utils.gcp import bigquery_etl_query, bigquery_dq_check + +docs = """ +### bqetl_merino_newtab_extract_to_gcs + +Built from bigquery-etl repo, [`dags/bqetl_merino_newtab_extract_to_gcs.py`](https://github.com/mozilla/bigquery-etl/blob/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py) + +#### Description + +Aggregates Newtab engagement data that lands in a GCS bucket for Merino recommendations. + +#### Owner + +cbeck@mozilla.com + +#### Tags + +* impact/tier_1 +* repo/bigquery-etl +""" + + +default_args = { + "owner": "cbeck@mozilla.com", + "start_date": datetime.datetime(2024, 8, 12, 0, 0), + "end_date": None, + "email": ["cbeck@mozilla.com", "gkatre@mozilla.com"], + "depends_on_past": False, + "retry_delay": datetime.timedelta(seconds=300), + "email_on_failure": True, + "email_on_retry": False, + "retries": 2, +} + +tags = ["impact/tier_1", "repo/bigquery-etl"] + +with DAG( + "bqetl_merino_newtab_extract_to_gcs", + default_args=default_args, + schedule_interval="*/20 * * * *", + doc_md=docs, + tags=tags, +) as dag: + + checks__fail_telemetry_derived__newtab_merino_extract__v1 = bigquery_dq_check( + task_id="checks__fail_telemetry_derived__newtab_merino_extract__v1", + source_table="newtab_merino_extract_v1", + dataset_id="telemetry_derived", + project_id="moz-fx-data-shared-prod", + is_dq_check_fail=True, + owner="cbeck@mozilla.com", + email=["cbeck@mozilla.com", "gkatre@mozilla.com"], + depends_on_past=False, + task_concurrency=1, + retries=0, + ) + + telemetry_derived__newtab_merino_extract__v1 = bigquery_etl_query( + task_id="telemetry_derived__newtab_merino_extract__v1", + destination_table="newtab_merino_extract_v1", + dataset_id="telemetry_derived", + project_id="moz-fx-data-shared-prod", + owner="cbeck@mozilla.com", + email=["cbeck@mozilla.com", "gkatre@mozilla.com"], + date_partition_parameter=None, + depends_on_past=False, + task_concurrency=1, + ) + + checks__fail_telemetry_derived__newtab_merino_extract__v1.set_upstream( + telemetry_derived__newtab_merino_extract__v1 + ) Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived: newtab_merino_extract_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql 2024-08-12 19:08:23.000000000 +0000 @@ -0,0 +1,16 @@ +-- macro checks + +#fail +{{ not_null(["scheduled_corpus_item_id"]) }} + +#fail +{{ is_unique(["scheduled_corpus_item_id"]) }} + +#fail +{{ not_null(["impression_count"]) }} + +#fail +{{ not_null(["click_count"]) }} + +#fail +{{ min_row_count(1) }} diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml 2024-08-12 19:10:20.000000000 +0000 @@ -0,0 +1,28 @@ +friendly_name: Newtab Merino Extract +description: |- + Aggregated Newtab events for Merino recommendations. + See https://mozilla-hub.atlassian.net/browse/MC-1256 +owners: +- cbeck@mozilla.com +- gkatre@mozilla.com +labels: + owner: cbeck + dag: bqetl_merino_newtab_extract_to_gcs + owner1: cbeck + owner2: gkatre +scheduling: + dag_name: bqetl_merino_newtab_extract_to_gcs + date_partition_parameter: null +bigquery: + time_partitioning: null + range_partitioning: null + clustering: null +workgroup_access: +- role: roles/bigquery.dataViewer + members: + - workgroup:mozilla-confidential +references: + checks.sql: + - .. + query.sql: + - moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql 2024-08-12 19:08:23.000000000 +0000 @@ -0,0 +1,47 @@ +WITH deduplicated_pings AS ( + SELECT + submission_timestamp, + document_id, + events, + FROM + `moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1` + WHERE + submission_timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) + QUALIFY + ROW_NUMBER() OVER ( + PARTITION BY + DATE(submission_timestamp), + document_id + ORDER BY + submission_timestamp DESC + ) = 1 +), +flattened_newtab_events AS ( + SELECT + document_id, + submission_timestamp, + unnested_events.name AS event_name, + mozfun.map.get_key( + unnested_events.extra, + 'scheduled_corpus_item_id' + ) AS scheduled_corpus_item_id, + mozfun.map.get_key(unnested_events.extra, 'position') AS position, + COUNT(1) OVER (PARTITION BY document_id, unnested_events.name) AS user_event_count + FROM + deduplicated_pings, + UNNEST(events) AS unnested_events + --filter to Pocket events + WHERE + unnested_events.category = 'pocket' + AND unnested_events.name IN ('impression', 'click', 'save', 'dismiss') + --keep only data with a non-null scheduled corpus item ID + AND (mozfun.map.get_key(unnested_events.extra, 'scheduled_corpus_item_id') IS NOT NULL) +) +SELECT + scheduled_corpus_item_id, + SUM(CASE WHEN event_name = 'impression' THEN 1 ELSE 0 END) AS impression_count, + SUM(CASE WHEN event_name = 'click' THEN 1 ELSE 0 END) AS click_count +FROM + flattened_newtab_events +GROUP BY + 1; diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml 2024-08-12 19:08:23.000000000 +0000 @@ -0,0 +1,10 @@ +fields: +- mode: NULLABLE + name: scheduled_corpus_item_id + type: STRING +- mode: NULLABLE + name: impression_count + type: INTEGER +- mode: NULLABLE + name: click_count + type: INTEGER ```

Link to full diff

chelseybeck commented 1 month ago

r+wc, is Merino going to dedupe the scheduled_corpus_item_id downstream from here?

no, the last select groups by this column, so duplicates aren't expected :)

dataops-ci-bot commented 1 month ago

Integration report for "removing save and dismiss events from filter"

sql.diff

Click to expand! ```diff Only in /tmp/workspace/generated-sql/dags/: bqetl_merino_newtab_extract_to_gcs.py diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py /tmp/workspace/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py --- /tmp/workspace/main-generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py 2024-08-12 20:00:59.000000000 +0000 @@ -0,0 +1,81 @@ +# Generated via https://github.com/mozilla/bigquery-etl/blob/main/bigquery_etl/query_scheduling/generate_airflow_dags.py + +from airflow import DAG +from airflow.sensors.external_task import ExternalTaskMarker +from airflow.sensors.external_task import ExternalTaskSensor +from airflow.utils.task_group import TaskGroup +import datetime +from operators.gcp_container_operator import GKEPodOperator +from utils.constants import ALLOWED_STATES, FAILED_STATES +from utils.gcp import bigquery_etl_query, bigquery_dq_check + +docs = """ +### bqetl_merino_newtab_extract_to_gcs + +Built from bigquery-etl repo, [`dags/bqetl_merino_newtab_extract_to_gcs.py`](https://github.com/mozilla/bigquery-etl/blob/generated-sql/dags/bqetl_merino_newtab_extract_to_gcs.py) + +#### Description + +Aggregates Newtab engagement data that lands in a GCS bucket for Merino recommendations. + +#### Owner + +cbeck@mozilla.com + +#### Tags + +* impact/tier_1 +* repo/bigquery-etl +""" + + +default_args = { + "owner": "cbeck@mozilla.com", + "start_date": datetime.datetime(2024, 8, 12, 0, 0), + "end_date": None, + "email": ["cbeck@mozilla.com", "gkatre@mozilla.com"], + "depends_on_past": False, + "retry_delay": datetime.timedelta(seconds=300), + "email_on_failure": True, + "email_on_retry": False, + "retries": 2, +} + +tags = ["impact/tier_1", "repo/bigquery-etl"] + +with DAG( + "bqetl_merino_newtab_extract_to_gcs", + default_args=default_args, + schedule_interval="*/20 * * * *", + doc_md=docs, + tags=tags, +) as dag: + + checks__fail_telemetry_derived__newtab_merino_extract__v1 = bigquery_dq_check( + task_id="checks__fail_telemetry_derived__newtab_merino_extract__v1", + source_table="newtab_merino_extract_v1", + dataset_id="telemetry_derived", + project_id="moz-fx-data-shared-prod", + is_dq_check_fail=True, + owner="cbeck@mozilla.com", + email=["cbeck@mozilla.com", "gkatre@mozilla.com"], + depends_on_past=False, + task_concurrency=1, + retries=0, + ) + + telemetry_derived__newtab_merino_extract__v1 = bigquery_etl_query( + task_id="telemetry_derived__newtab_merino_extract__v1", + destination_table="newtab_merino_extract_v1", + dataset_id="telemetry_derived", + project_id="moz-fx-data-shared-prod", + owner="cbeck@mozilla.com", + email=["cbeck@mozilla.com", "gkatre@mozilla.com"], + date_partition_parameter=None, + depends_on_past=False, + task_concurrency=1, + ) + + checks__fail_telemetry_derived__newtab_merino_extract__v1.set_upstream( + telemetry_derived__newtab_merino_extract__v1 + ) Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived: newtab_merino_extract_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql 2024-08-12 19:54:21.000000000 +0000 @@ -0,0 +1,16 @@ +-- macro checks + +#fail +{{ not_null(["scheduled_corpus_item_id"]) }} + +#fail +{{ is_unique(["scheduled_corpus_item_id"]) }} + +#fail +{{ not_null(["impression_count"]) }} + +#fail +{{ not_null(["click_count"]) }} + +#fail +{{ min_row_count(1) }} diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/metadata.yaml 2024-08-12 19:56:27.000000000 +0000 @@ -0,0 +1,28 @@ +friendly_name: Newtab Merino Extract +description: |- + Aggregated Newtab events for Merino recommendations. + See https://mozilla-hub.atlassian.net/browse/MC-1256 +owners: +- cbeck@mozilla.com +- gkatre@mozilla.com +labels: + owner: cbeck + dag: bqetl_merino_newtab_extract_to_gcs + owner1: cbeck + owner2: gkatre +scheduling: + dag_name: bqetl_merino_newtab_extract_to_gcs + date_partition_parameter: null +bigquery: + time_partitioning: null + range_partitioning: null + clustering: null +workgroup_access: +- role: roles/bigquery.dataViewer + members: + - workgroup:mozilla-confidential +references: + checks.sql: + - .. + query.sql: + - moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1 diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql 2024-08-12 19:54:21.000000000 +0000 @@ -0,0 +1,45 @@ +WITH deduplicated_pings AS ( + SELECT + submission_timestamp, + document_id, + events, + FROM + `moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1` + WHERE + submission_timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) + QUALIFY + ROW_NUMBER() OVER ( + PARTITION BY + DATE(submission_timestamp), + document_id + ORDER BY + submission_timestamp DESC + ) = 1 +), +flattened_newtab_events AS ( + SELECT + document_id, + submission_timestamp, + unnested_events.name AS event_name, + mozfun.map.get_key( + unnested_events.extra, + 'scheduled_corpus_item_id' + ) AS scheduled_corpus_item_id, + FROM + deduplicated_pings, + UNNEST(events) AS unnested_events + --filter to Pocket events + WHERE + unnested_events.category = 'pocket' + AND unnested_events.name IN ('impression', 'click') + --keep only data with a non-null scheduled corpus item ID + AND (mozfun.map.get_key(unnested_events.extra, 'scheduled_corpus_item_id') IS NOT NULL) +) +SELECT + scheduled_corpus_item_id, + SUM(CASE WHEN event_name = 'impression' THEN 1 ELSE 0 END) AS impression_count, + SUM(CASE WHEN event_name = 'click' THEN 1 ELSE 0 END) AS click_count +FROM + flattened_newtab_events +GROUP BY + scheduled_corpus_item_id diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml --- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml 1970-01-01 00:00:00.000000000 +0000 +++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml 2024-08-12 19:54:21.000000000 +0000 @@ -0,0 +1,10 @@ +fields: +- mode: NULLABLE + name: scheduled_corpus_item_id + type: STRING +- mode: NULLABLE + name: impression_count + type: INTEGER +- mode: NULLABLE + name: click_count + type: INTEGER ```

Link to full diff

fbertsch commented 1 month ago

r+wc, is Merino going to dedupe the scheduled_corpus_item_id downstream from here?

no, the last select groups by this column, so duplicates aren't expected :)

Yeah but if you upload this file every 20 minutes, then whatever reads from that file will see dupes (each 20 minute period will be uploaded 72 times).

chelseybeck commented 1 month ago

r+wc, is Merino going to dedupe the scheduled_corpus_item_id downstream from here?

no, the last select groups by this column, so duplicates aren't expected :)

Yeah but if you upload this file every 20 minutes, then whatever reads from that file will see dupes (each 20 minute period will be uploaded 72 times).

ah, i see...merino must handle that...we'll check once ml is able to test on their end, but i assume it does b/c it re-ranks the recommendations based on engagement