malloydata / malloy

Malloy is an experimental language for describing data relationships and transformations.
http://www.malloydata.dev
MIT License
2k stars 76 forks source link

Malloy generates slow BigQuery queries when calculating metrics across joins #2019

Open ccharlesgb opened 4 days ago

ccharlesgb commented 4 days ago

What happens?

For context. We are using Malloy to see if it can help us define metrics in a composable way rather than using SQL.

We have some event data from our tracking solution that has events split up into a hierarchy of:

  1. sessions - A consumer visits our website and starts a session
  2. views - They view pages on the website
  3. UI Interactions - They click buttons on the website

We are using Malloy to calculate session metrics such as number of different pages viewed and number of button X clicked etc.

There is one to many relationship on these 3 models:

  1. Sessions can have many views, Views can have many UI interactions.

Malloy does a great job of producing the correct SQL but if we calculate the metrics using one run query it is very slow (This is a very large dataset) and takes ~10 minutes to calculate the metrics.

If we split the query up to calculate the aggregates for each source separately and then join them up after (By taking the compiled SQL of each run query and putting them in 3 CTEs) then the query executes in ~20 seconds which is a significant speed up.

To Reproduce

It's hard to reproduce without having a very large dataset but here is a hard-coded version of the data model we are working with to give you the idea:

source: consumer_views_sql is bigquery.sql("""
  SELECT "u1" as unique_row_id, "view1" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "advert_view" as page_name
  UNION ALL
  SELECT "u2" as unique_row_id, "view2" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
  UNION ALL
  SELECT "u3" as unique_row_id, "view3" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
"""
)

source: consumer_interactions_sql is bigquery.sql("""
  SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "button1" as ui_label, "view1" as view_id
  UNION ALL
  SELECT "b" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "call" as ui_label, "view1" as view_id
  UNION ALL
  SELECT "c" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "live-chat-provider1" as ui_label, "view1" as view_id
"""
)

source: consumer_sessions_sql is bigquery.sql("""
  SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01 15:00:00" as session_end_time, "2024-01-01" as session_start_date, "user1" as signed_in_id
"""
)

source: consumer_views is consumer_views_sql extend  {
  primary_key:
    unique_row_id

  dimension:
    metric_date is view_start_date
    is_advert_view is ends_with(page_name, "advert_view")
    is_search_view is ends_with(page_name, "search_view")

  measure:
    view_count is count()
    search_views is view_count { where: is_search_view }
    advert_views is view_count { where: is_advert_view }
}

source: consumer_ui_interactions is consumer_interactions_sql extend {
  primary_key:
    unique_row_id

  dimension:
    metric_date is interaction_date
    is_chat_interaction is (ui_label ~ 'live-chat%')
    is_click_to_call_interaction is ui_label ? 'call button' |  'call'
  measure:
    interaction_count is count()
    view_count is count(view_id)

    chat_interactions is interaction_count { where: is_chat_interaction }
    click_to_call_interactions is interaction_count { where: is_click_to_call_interaction }

  join_one: consumer_views on view_id = consumer_views.view_id and interaction_date = consumer_views.view_start_date
}

source: consumer_sessions is consumer_sessions_sql extend {
  primary_key:
    unique_row_id
  dimension:
    logged_in is signed_in_id != null
    metric_date is session_start_date

  measure:
    session_count is count(session_id)
    max_session_end_time is max(session_end_time)

  join_many: consumer_views on session_id = consumer_views.session_id and session_start_date = consumer_views.view_start_date
  join_many: consumer_ui_interactions on session_id = consumer_ui_interactions.session_id and session_start_date = consumer_ui_interactions.interaction_date
}

-- This is very slow!

run: consumer_sessions -> {
  group_by: metric_date, session_id
  aggregate:
    max_session_end_time
    consumer_views.advert_views
    consumer_views.search_views
    consumer_ui_interactions.interaction_count
    consumer_ui_interactions.chat_interactions
    consumer_ui_interactions.click_to_call_interactions
}

-- Splitting out into 3 CTEs and joining on session_id/date is much faster
run: consumer_sessions -> {
  group_by: metric_date, session_id
  aggregate:
    max_session_end_time
}

run: consumer_views -> {
  group_by: metric_date, session_id
  aggregate:
    advert_views
    search_views
}

run: consumer_ui_interactions -> {
  group_by: metric_date, session_id
  aggregate:
    interaction_count
    chat_interactions
    click_to_call_interactions
}

The slow query produces this SQL:

SELECT 
   base.session_start_date as `metric_date`,
   base.session_id as `session_id`,
   max(base.session_end_time) as `max_session_end_time`,
   (COUNT(DISTINCT CASE WHEN (COALESCE(ENDS_WITH(consumer_views_0.page_name, 'advert_view'), false)) THEN consumer_views_0.unique_row_id END)) as `advert_views`,
   (COUNT(DISTINCT CASE WHEN (COALESCE(ENDS_WITH(consumer_views_0.page_name, 'search_view'), false)) THEN consumer_views_0.unique_row_id END)) as `search_views`,
   COUNT(DISTINCT consumer_ui_interactions_0.unique_row_id) as `interaction_count`,
   (COUNT(DISTINCT CASE WHEN (((consumer_ui_interactions_0.ui_label LIKE 'live-chat%'))) THEN consumer_ui_interactions_0.unique_row_id END)) as `chat_interactions`,
   (COUNT(DISTINCT CASE WHEN (consumer_ui_interactions_0.ui_label IN ('call button','call')) THEN consumer_ui_interactions_0.unique_row_id END)) as `click_to_call_interactions`
FROM (
  SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01 15:00:00" as session_end_time, "2024-01-01" as session_start_date, "user1" as signed_in_id
) as base
 LEFT JOIN (
  SELECT "u1" as unique_row_id, "view1" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "advert_view" as page_name
  UNION ALL
  SELECT "u2" as unique_row_id, "view2" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
  UNION ALL
  SELECT "u3" as unique_row_id, "view3" as view_id, "session1" as session_id, "2024-01-01" as view_start_date, "search_view" as page_name
) AS consumer_views_0
  ON (base.session_id=consumer_views_0.session_id) and (base.session_start_date=consumer_views_0.view_start_date)
 LEFT JOIN (
  SELECT "a" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "button1" as ui_label, "view1" as view_id
  UNION ALL
  SELECT "b" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "call" as ui_label, "view1" as view_id
  UNION ALL
  SELECT "c" as unique_row_id, "session1" as session_id, "2024-01-01" as interaction_date, "live-chat-provider1" as ui_label, "view1" as view_id
) AS consumer_ui_interactions_0
  ON (base.session_id=consumer_ui_interactions_0.session_id) and (base.session_start_date=consumer_ui_interactions_0.interaction_date)
GROUP BY 1,2
ORDER BY 3 desc

I think this is happening because the generated SQL has to fan out the view metrics and session metrics into the UI interaction grain. There are many more UI interactions than views and many more views than sessions. So BQ is wasting a lot of time doing the COUNT(DISTINCT ...) on the view and session grain as it is fanned out to UI interactions.

OS:

MacOS

Malloy Client:

VS Code

Malloy Client Version:

v0.2.1731440058

Database Connection:

BigQuery