mozilla / lookml-generator

LookML Generator for Glean and Mozilla Data
https://mozilla.github.io/lookml-generator/generator.html
Mozilla Public License 2.0
17 stars 17 forks source link

Event funnel queries are slow #282

Open wlach opened 3 years ago

wlach commented 3 years ago

The following query (generated by the event funnel explore) is slower than expected:

WITH
  step_1 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(mozfun.event_analysis.escape_metachars(property_value.value), mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'events')
    AND (event = 'app_opened')
    ),
  step_2 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(mozfun.event_analysis.escape_metachars(property_value.value),mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'context_menu')
    AND (event = 'item_tapped')
    AND (properties.key = 'named')
    AND (property_value.key = 'share_link') )
SELECT
  COUNT(CASE
      WHEN REGEXP_CONTAINS(funnel_analysis.events, mozfun.event_analysis.create_funnel_regex([step_1.match_string], TRUE)) THEN 1
    ELSE
    NULL
  END
    ) AS funnel_analysis_count_completed_step_1,
  COUNT(CASE
      WHEN (REGEXP_CONTAINS(funnel_analysis.events, mozfun.event_analysis.create_funnel_regex([step_1.match_string], TRUE))) AND (REGEXP_CONTAINS(funnel_analysis.events, mozfun.event_analysis.create_funnel_regex([step_1.match_string, step_2.match_string], TRUE))) THEN 1
    ELSE
    NULL
  END
    ) AS funnel_analysis_count_completed_step_2
FROM
  `mozdata.fenix.events_daily` AS funnel_analysis
CROSS JOIN
  step_1
CROSS JOIN
  step_2
WHERE
  funnel_analysis.submission_date = DATE('2021-10-18')
LIMIT
  500

According to BigQuery, it's only processing 2.4gb worth of data but when I run it in the console (in the mozdata project), it takes a full 21.4 seconds to run. Looking at the query profile, this segment sticks out:

image

It appears this corresponds to the REGEXP_CONTAINS part in the latter part of the query. It seems most likely that it's the repeated calls to REGEXP_CONTAINS that are slow.

wlach commented 3 years ago

The performance is even worse when no event matches the criteria specified in step 1 and step 2. No idea why.

wlach commented 3 years ago

I had a try at speeding this up with a JavaScript UDF, I figured I might be able to reduce the number of time spent evaluating different regexes that way. Slot time seems to go down from ~28 minutes to 7 minutes which is a nice improvement, though the numbers seem to be a bit off. We'd need to rewrite our regex generation if we took this approach, I'm doing some hacky stuff below but it's not generating quite the right results:

CREATE TEMP FUNCTION
  udf_js_regex_match (regex STRING, input STRING)
  RETURNS int64
  LANGUAGE js AS """

    if (input == null || !input.length) {
      return 0;
    }
    re = new RegExp(regex.replace(/\\Q/g, '').replace(/\\E/g, '').replace('(?:.*?)', '.*').replace(/\\?:/g, '').slice(1,-2) + "?)");
    let match = input.match(re)
    return match ? match.filter(m=>m !== undefined).length : 0;
""";

WITH
  step_1 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(property_value.value, mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'events')
    AND (event = 'app_opened')
    AND 1=1 -- no filter on 'step_1.property_value'
    ),
  step_2 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(property_value.value,mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'context_menu')
    AND (event = 'item_tapped')
    AND (properties.key = 'named')
    AND (property_value.key = 'share_link')),
    results as (
SELECT
  udf_js_regex_match(mozfun.event_analysis.create_funnel_regex([step_1.match_string, step_2.match_string], TRUE), funnel_analysis.events) AS funnel_analysis_count_completed_step_2
FROM
  `mozdata.fenix.events_daily` AS funnel_analysis
CROSS JOIN
  step_1
CROSS JOIN
  step_2
WHERE
  funnel_analysis.submission_date = DATE('2021-10-18')
)
select countif(funnel_analysis_count_completed_step_2 > 1) as step1, countif(funnel_analysis_count_completed_step_2 > 2) as step2 from results
wlach commented 3 years ago

Update: we have opened a support request with GCP to diagnose and hopefully provide some recommendations

wlach commented 3 years ago

Update: we have opened a support request with GCP to diagnose and hopefully provide some recommendations

Google looked into it, but basically reproduced the same set of findings that are already in this issue: repeated use of regular expressions in BigQuery is (relatively) slow.

wlach commented 3 years ago

Feedback from Google suggests that the query as I'm testing it is only using a small number of slots (87) which may be partly responsible for the slowness. I think we're using on-demand pricing with GCP + Looker, which might limit the number of slots allocated to this query. Something to bear in mind if this comes up as a big issue for people.