mozilla / bigquery-etl

Bigquery ETL
https://mozilla.github.io/bigquery-etl
Mozilla Public License 2.0
259 stars 101 forks source link

Schema update with `--update_downstream` not working #2278

Closed jklukas closed 3 years ago

jklukas commented 3 years ago

I am attempting to add a new field to clients_daily_v6 following the common workflows documentation.

I've added the new field to the query, but running the schema update only updates clients_daily:

$ ./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/Users/jeffklukas/Code/zbqetl/target/dependency/protobuf-java-3.6.1.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
telemetry_derived/foo/query.sql                             OK
Cannot get Bigquery schema for mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/glean/glean/glean.1.schema.json: "filename 'mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/glean/glean/glean.1.bq' not found"
Cannot get Bigquery schema for mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/credentials/credentials.1.schema.json: "filename 'mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/credentials/credentials.1.bq' not found"
Cannot get Bigquery schema for mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/metaschema/metaschema.1.schema.json: "filename 'mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/metaschema/metaschema.1.bq' not found"
Cannot get Bigquery schema for mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/pioneer-ingestion/pioneer-ingestion.1.schema.json: "filename 'mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/pioneer-ingestion/pioneer-ingestion.1.bq' not found"
Cannot get Bigquery schema for mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/sources/sources.1.schema.json: "filename 'mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/sources/sources.1.bq' not found"
Cannot get Bigquery schema for mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/structured-ingestion/structured-ingestion.1.schema.json: "filename 'mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/structured-ingestion/structured-ingestion.1.bq' not found"
Cannot get Bigquery schema for mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/telemetry-ingestion/telemetry-ingestion.1.schema.json: "filename 'mozilla-pipeline-schemas-3360ed7b0cfec9e1fde6e94bc884b1c6a34234ca/schemas/metadata/telemetry-ingestion/telemetry-ingestion.1.bq' not found"
sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql OK
moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql OK
Schema sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/schema.yaml updated.

I was expecting several downstream tables to get updated. I'm wondering if this is a regression related to changes in https://github.com/mozilla/bigquery-etl/pull/2244

cc @scholtzan

jklukas commented 3 years ago

See the change I'm attempting in https://github.com/mozilla/bigquery-etl/compare/background_update?expand=1

scholtzan commented 3 years ago

Did you try undoing all changes in clients_daily_v6/schema.yaml and then run the command again? What happens is that if clients_daily_v6/schema.yaml got updated earlier, and then the update command gets executed again it appears as if no new changes have been made to the schema and downstream dependencies won't get analyzed.

jklukas commented 3 years ago

Did you try undoing all changes in clients_daily_v6/schema.yaml and then run the command again? What happens is that if clients_daily_v6/schema.yaml got updated earlier, and then the update command gets executed again it appears as if no new changes have been made to the schema and downstream dependencies won't get analyzed.

You were correct on this. I undid the schema changes, and it determined dependencies correctly, but looks to be hitting other errors now:

Update downstream dependency schema for sql/moz-fx-data-shared-prod/internet_outages/global_outages_v1/query.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/messaging_system/onboarding_retention_events_amplitude/view.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/search_derived/search_clients_daily_v8/query.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry/clients_daily/view.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry/clients_daily_v6/view.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry/windows_10_aggregate/view.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry/windows_10_build_distribution/view.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry/windows_10_patch_adoption/view.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_joined_v1/query.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v1/query.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_v1/query.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry_derived/devtools_panel_usage_v1/query.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry_derived/experiments_daily_active_clients_v1/query.sql
Update downstream dependency schema for sql/moz-fx-data-shared-prod/telemetry_derived/italy_covid19_outage_v1/query.sql
sql/moz-fx-data-shared-prod/internet_outages/global_outages_v1/query.sql OK
moz-fx-data-shared-prod/internet_outages/global_outages_v1/query.sql OK
Schema sql/moz-fx-data-shared-prod/internet_outages/global_outages_v1/schema.yaml updated.
No metadata defined for sql/moz-fx-data-shared-prod/messaging_system/onboarding_retention_events_amplitude/view.sql
sql/moz-fx-data-shared-prod/messaging_system/onboarding_retention_events_amplitude/view.sql OK
moz-fx-data-shared-prod/messaging_system/onboarding_retention_events_amplitude/query.sql ERROR
 [{'code': 400, 'errors': [{'message': "Cannot query over table 'moz-fx-data-shared-prod.messaging_system_derived.onboarding_users_last_seen_v1' without a filter over column(s) 'submission_date' that can be used for partition elimination", 'domain': 'global', 'reason': 'invalidQuery', 'location': 'q', 'locationType': 'parameter'}], 'response': {'headers': {'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'cache-control': 'private', 'content-encoding': 'gzip', 'content-type': 'application/json; charset=UTF-8', 'date': 'Fri, 20 Aug 2021 19:13:34 GMT', 'server': 'ESF', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'x-content-type-options': 'nosniff', 'x-frame-options': 'SAMEORIGIN', 'x-xss-protection': '0'}}, 'message': "Cannot query over table 'moz-fx-data-shared-prod.messaging_system_derived.onboarding_users_last_seen_v1' without a filter over column(s) 'submission_date' that can be used for partition elimination"}]
Cannot get schema for moz-fx-data-shared-prod.messaging_system.onboarding_retention_events_amplitude: Error when dry running SQL file moz-fx-data-shared-prod/messaging_system/onboarding_retention_events_amplitude/query.sql
Traceback (most recent call last):
  File "/Users/jeffklukas/Code/zbqetl/venv/bin/bqetl", line 33, in <module>
    sys.exit(load_entry_point('mozilla-bigquery-etl', 'console_scripts', 'bqetl')())
  File "/Users/jeffklukas/Code/zbqetl/bigquery_etl/cli/__init__.py", line 56, in cli
    group(prog_name=prog_name)
  File "/Users/jeffklukas/Code/zbqetl/venv/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jeffklukas/Code/zbqetl/venv/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/Users/jeffklukas/Code/zbqetl/venv/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jeffklukas/Code/zbqetl/venv/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jeffklukas/Code/zbqetl/venv/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jeffklukas/Code/zbqetl/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jeffklukas/Code/zbqetl/venv/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/Users/jeffklukas/Code/zbqetl/bigquery_etl/cli/query.py", line 791, in update
    changed = _update_query_schema(
  File "/Users/jeffklukas/Code/zbqetl/bigquery_etl/cli/query.py", line 1008, in _update_query_schema
    query_schema.merge(table_schema)
  File "/Users/jeffklukas/Code/zbqetl/bigquery_etl/schema/__init__.py", line 97, in merge
    self.schema["fields"],
KeyError: 'fields'
jklukas commented 3 years ago

I believe the just-merged #2275 may fix the "fields" error.

And I just filed https://github.com/mozilla/bigquery-etl/pull/2279 to remove unneeded Amplitude views, which should fix the onboarding_retention_events_amplitude/query.sql ERROR.

jklukas commented 3 years ago

I believe the just-merged #2275 may fix the "fields" error.

Looking at this more closely, it's not the glam queries causing this error. It's coming from query_schema.merge, where the query_schema I believe is empty because a dry run failed.

jklukas commented 3 years ago

I think the update_downstream logic is not supposed to consider view files, but it's processing views and getting confused with the missing schemas.

scholtzan commented 3 years ago

You are right. https://github.com/mozilla/bigquery-etl/pull/2280 will ensure that only query.sql files are considered