Open wlach opened 3 years ago
The performance is even worse when no event matches the criteria specified in step 1 and step 2. No idea why.
I had a try at speeding this up with a JavaScript UDF, I figured I might be able to reduce the number of time spent evaluating different regexes that way. Slot time seems to go down from ~28 minutes to 7 minutes which is a nice improvement, though the numbers seem to be a bit off. We'd need to rewrite our regex generation if we took this approach, I'm doing some hacky stuff below but it's not generating quite the right results:
CREATE TEMP FUNCTION
udf_js_regex_match (regex STRING, input STRING)
RETURNS int64
LANGUAGE js AS """
if (input == null || !input.length) {
return 0;
}
re = new RegExp(regex.replace(/\\Q/g, '').replace(/\\E/g, '').replace('(?:.*?)', '.*').replace(/\\?:/g, '').slice(1,-2) + "?)");
let match = input.match(re)
return match ? match.filter(m=>m !== undefined).length : 0;
""";
WITH
step_1 AS (
SELECT
mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(property_value.value, mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
FROM
`mozdata.fenix.event_types` AS et
CROSS JOIN
UNNEST(event_properties) AS properties
CROSS JOIN
UNNEST(properties.value) AS property_value
WHERE
(category = 'events')
AND (event = 'app_opened')
AND 1=1 -- no filter on 'step_1.property_value'
),
step_2 AS (
SELECT
mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(property_value.value,mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
FROM
`mozdata.fenix.event_types` AS et
CROSS JOIN
UNNEST(event_properties) AS properties
CROSS JOIN
UNNEST(properties.value) AS property_value
WHERE
(category = 'context_menu')
AND (event = 'item_tapped')
AND (properties.key = 'named')
AND (property_value.key = 'share_link')),
results as (
SELECT
udf_js_regex_match(mozfun.event_analysis.create_funnel_regex([step_1.match_string, step_2.match_string], TRUE), funnel_analysis.events) AS funnel_analysis_count_completed_step_2
FROM
`mozdata.fenix.events_daily` AS funnel_analysis
CROSS JOIN
step_1
CROSS JOIN
step_2
WHERE
funnel_analysis.submission_date = DATE('2021-10-18')
)
select countif(funnel_analysis_count_completed_step_2 > 1) as step1, countif(funnel_analysis_count_completed_step_2 > 2) as step2 from results
Update: we have opened a support request with GCP to diagnose and hopefully provide some recommendations
Update: we have opened a support request with GCP to diagnose and hopefully provide some recommendations
Google looked into it, but basically reproduced the same set of findings that are already in this issue: repeated use of regular expressions in BigQuery is (relatively) slow.
Feedback from Google suggests that the query as I'm testing it is only using a small number of slots (87) which may be partly responsible for the slowness. I think we're using on-demand pricing with GCP + Looker, which might limit the number of slots allocated to this query. Something to bear in mind if this comes up as a big issue for people.
The following query (generated by the event funnel explore) is slower than expected:
According to BigQuery, it's only processing 2.4gb worth of data but when I run it in the console (in the mozdata project), it takes a full 21.4 seconds to run. Looking at the query profile, this segment sticks out:
It appears this corresponds to the REGEXP_CONTAINS part in the latter part of the query. It seems most likely that it's the repeated calls to REGEXP_CONTAINS that are slow.