For context. We are using Malloy to see if it can help us define metrics in a composable way rather than using SQL.
We have some event data from our tracking solution that has events split up into a hierarchy of:
sessions - A consumer visits our website and starts a session
views - They view pages on the website
UI Interactions - They click buttons on the website
We are using Malloy to calculate session metrics such as number of different pages viewed and number of button X clicked etc.
There is one to many relationship on these 3 models:
Sessions can have many views, Views can have many UI interactions.
Malloy does a great job of producing the correct SQL but if we calculate the metrics using one run query it is very slow (This is a very large dataset) and takes ~10 minutes to calculate the metrics.
If we split the query up to calculate the aggregates for each source separately and then join them up after (By taking the compiled SQL of each run query and putting them in 3 CTEs) then the query executes in ~20 seconds which is a significant speed up.
To Reproduce
It's hard to reproduce without having a very large dataset but here is a hard-coded version of the data model we are working with to give you the idea:
source: consumer_views_sql is bigquery.sql("""
SELECT "u1" as unique_row_id, "view1" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "advert_view" as page_name
UNION ALL
SELECT "u2" as unique_row_id, "view2" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
UNION ALL
SELECT "u3" as unique_row_id, "view3" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
"""
)
source: consumer_interactions_sql is bigquery.sql("""
SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "button1" as ui_label, "view1" as view_id
UNION ALL
SELECT "b" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "call" as ui_label, "view1" as view_id
UNION ALL
SELECT "c" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "live-chat-provider1" as ui_label, "view1" as view_id
"""
)
source: consumer_sessions_sql is bigquery.sql("""
SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01 15:00:00" as session_end_time, "2024-01-01" as session_start_date, "user1" as signed_in_id
"""
)
source: consumer_views is consumer_views_sql extend {
primary_key:
unique_row_id
dimension:
metric_date is view_start_date
is_advert_view is ends_with(page_name, "advert_view")
is_search_view is ends_with(page_name, "search_view")
measure:
view_count is count()
search_views is view_count { where: is_search_view }
advert_views is view_count { where: is_advert_view }
}
source: consumer_ui_interactions is consumer_interactions_sql extend {
primary_key:
unique_row_id
dimension:
metric_date is interaction_date
is_chat_interaction is (ui_label ~ 'live-chat%')
is_click_to_call_interaction is ui_label ? 'call button' | 'call'
measure:
interaction_count is count()
view_count is count(view_id)
chat_interactions is interaction_count { where: is_chat_interaction }
click_to_call_interactions is interaction_count { where: is_click_to_call_interaction }
join_one: consumer_views on view_id = consumer_views.view_id and interaction_date = consumer_views.view_start_date
}
source: consumer_sessions is consumer_sessions_sql extend {
primary_key:
unique_row_id
dimension:
logged_in is signed_in_id != null
metric_date is session_start_date
measure:
session_count is count(session_id)
max_session_end_time is max(session_end_time)
join_many: consumer_views on session_id = consumer_views.session_id and session_start_date = consumer_views.view_start_date
join_many: consumer_ui_interactions on session_id = consumer_ui_interactions.session_id and session_start_date = consumer_ui_interactions.interaction_date
}
-- This is very slow!
run: consumer_sessions -> {
group_by: metric_date, session_id
aggregate:
max_session_end_time
consumer_views.advert_views
consumer_views.search_views
consumer_ui_interactions.interaction_count
consumer_ui_interactions.chat_interactions
consumer_ui_interactions.click_to_call_interactions
}
-- Splitting out into 3 CTEs and joining on session_id/date is much faster
run: consumer_sessions -> {
group_by: metric_date, session_id
aggregate:
max_session_end_time
}
run: consumer_views -> {
group_by: metric_date, session_id
aggregate:
advert_views
search_views
}
run: consumer_ui_interactions -> {
group_by: metric_date, session_id
aggregate:
interaction_count
chat_interactions
click_to_call_interactions
}
The slow query produces this SQL:
SELECT
base.session_start_date as `metric_date`,
base.session_id as `session_id`,
max(base.session_end_time) as `max_session_end_time`,
(COUNT(DISTINCT CASE WHEN (COALESCE(ENDS_WITH(consumer_views_0.page_name, 'advert_view'), false)) THEN consumer_views_0.unique_row_id END)) as `advert_views`,
(COUNT(DISTINCT CASE WHEN (COALESCE(ENDS_WITH(consumer_views_0.page_name, 'search_view'), false)) THEN consumer_views_0.unique_row_id END)) as `search_views`,
COUNT(DISTINCT consumer_ui_interactions_0.unique_row_id) as `interaction_count`,
(COUNT(DISTINCT CASE WHEN (((consumer_ui_interactions_0.ui_label LIKE 'live-chat%'))) THEN consumer_ui_interactions_0.unique_row_id END)) as `chat_interactions`,
(COUNT(DISTINCT CASE WHEN (consumer_ui_interactions_0.ui_label IN ('call button','call')) THEN consumer_ui_interactions_0.unique_row_id END)) as `click_to_call_interactions`
FROM (
SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01 15:00:00" as session_end_time, "2024-01-01" as session_start_date, "user1" as signed_in_id
) as base
LEFT JOIN (
SELECT "u1" as unique_row_id, "view1" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "advert_view" as page_name
UNION ALL
SELECT "u2" as unique_row_id, "view2" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
UNION ALL
SELECT "u3" as unique_row_id, "view3" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
) AS consumer_views_0
ON (base.session_id=consumer_views_0.session_id) and (base.session_start_date=consumer_views_0.view_start_date)
LEFT JOIN (
SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "button1" as ui_label, "view1" as view_id
UNION ALL
SELECT "b" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "call" as ui_label, "view1" as view_id
UNION ALL
SELECT "c" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "live-chat-provider1" as ui_label, "view1" as view_id
) AS consumer_ui_interactions_0
ON (base.session_id=consumer_ui_interactions_0.session_id) and (base.session_start_date=consumer_ui_interactions_0.interaction_date)
GROUP BY 1,2
ORDER BY 3 desc
I think this is happening because the generated SQL has to fan out the view metrics and session metrics into the UI interaction grain. There are many more UI interactions than views and many more views than sessions. So BQ is wasting a lot of time doing the COUNT(DISTINCT ...) on the view and session grain as it is fanned out to UI interactions.
What happens?
For context. We are using Malloy to see if it can help us define metrics in a composable way rather than using SQL.
We have some event data from our tracking solution that has events split up into a hierarchy of:
We are using Malloy to calculate session metrics such as number of different pages viewed and number of button X clicked etc.
There is one to many relationship on these 3 models:
Malloy does a great job of producing the correct SQL but if we calculate the metrics using one run query it is very slow (This is a very large dataset) and takes ~10 minutes to calculate the metrics.
If we split the query up to calculate the aggregates for each source separately and then join them up after (By taking the compiled SQL of each run query and putting them in 3 CTEs) then the query executes in ~20 seconds which is a significant speed up.
To Reproduce
It's hard to reproduce without having a very large dataset but here is a hard-coded version of the data model we are working with to give you the idea:
The slow query produces this SQL:
I think this is happening because the generated SQL has to fan out the view metrics and session metrics into the UI interaction grain. There are many more UI interactions than views and many more views than sessions. So BQ is wasting a lot of time doing the
COUNT(DISTINCT ...)
on the view and session grain as it is fanned out to UI interactions.OS:
MacOS
Malloy Client:
VS Code
Malloy Client Version:
v0.2.1731440058
Database Connection:
BigQuery