mozilla / bigquery-etl

Bigquery ETL
https://mozilla.github.io/bigquery-etl
Mozilla Public License 2.0
259 stars 101 forks source link

Reduce complexity of telemetry.main view #1971

Open jklukas opened 3 years ago

jklukas commented 3 years ago

We keep run into situations where user queries or ETL queries on top of telemetry.main start to fail with "query too complex" errors as schemas grow or the end-state queries need to add more logic.

A big part of the issue is that the main view does several nested SELECT AS STRUCT queries to "clean" deeply nested fields. It also makes several UDF references, which add to query complexity. We likely could reduce complexity of the view significantly by listing out struct fields explicitly and injecting the field cleaning logic directly into the view, avoiding UDF calls.

Here is a query constructed by hand that unrolls most of the complex parts of the view:

SELECT
  * REPLACE ( STRUCT(metadata.geo,
      STRUCT(metadata.header.date,
        metadata.header.dnt,
        metadata.header.x_debug_id,
        metadata.header.x_pingsender_version,
        metadata.header.x_source_tags,
        SAFE.PARSE_TIMESTAMP( '%a, %d %b %Y %T %Z',
          -- Even though it's not part of the spec, many clients append
          -- '+00:00' to their Date headers, so we strip that suffix.
          REPLACE(metadata.header.`date`, 'GMT+00:00', 'GMT') ) AS parsed_date,
        ARRAY(
        SELECT
          TRIM(t)
        FROM
          UNNEST(SPLIT(metadata.header.x_source_tags, ',')) t ) AS parsed_x_source_tags ) AS header,
      --metadata.header,
      metadata.isp,
      metadata.uri,
      metadata.user_agent) AS metadata,
    --mozfun.norm.metadata(metadata) AS metadata,
    STRUCT(payload.histograms,
      (
      SELECT
        AS STRUCT payload.info.* REPLACE (
        IF
          (
            -- Bug 1592012 - some clients report bogus negative session lengths;
            -- we also filter out values longer than 1 year, which is highly unlikely.
            payload.info.session_length BETWEEN 0
            AND 31536000,
            payload.info.session_length,
            NULL ) AS session_length ) ) AS info,
      --payload.info,
      payload.keyed_histograms,
      payload.late_writes,
      payload.processes,
      payload.simple_measurements,
      payload.slow_sql,
      payload.ui_measurements,
      payload.ver) AS payload
    --mozdata.udf.normalize_main_payload(payload) AS payload
    )
FROM
  `moz-fx-data-shared-prod.telemetry_stable.main_v4`

These pieces of the main_v4 schema do not change often, but we would need to update the view whenever we do add fields to metadata or payload or payload.info. A great future state would be that the generate_stable_view module actually walks the BQ schema and generates the list of fields, but that would require significantly more code investment.

jklukas commented 3 years ago

@scholtzan - I think you've hit this a few times. Do you have an opinion about whether it's worth investing in this?

scholtzan commented 3 years ago

I think it might be something worth considering. I have a couple of not too complex queries that had to switch to telemetry_stable.main_v4 either because the query was too deemed to complex by BigQuery or not enough computing resources were available when using telemetry.main. So, for me there have been quite a few times where telemetry.main has become unusable.