matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.9k stars 2.65k forks source link

Use rollup queries in custom dimension reports #22647

Open mneudert opened 1 month ago

mneudert commented 1 month ago

Description

Alternative implementation of #22571 using WITH ROLLUP

To avoid running two queries for custom dimension reports, this PR explores the option of using rollup queries.

Fixture changes and updated result expectations have been copied from #22571. The failing entry in the test_reportLimitingdimension_2_rankingQuery__CustomDimensions.getCustomDimension_day.xml fixture stems from a new Others row being added by the implementation of this PR. Needs to be verified if the changes are actually correct or a bad side effect.

As WITH ROLLUP uses NULL for the summary rows, we need to prevent those values changing the results. If we cannot COALESCE those values we need find a way to update the counter expressions to handle them.

Followup tasks

  1. Verify the query results are actually correct.
  2. Add more tests and further improve test data.
  3. Replace the subquery sorting with a sorted rollup if the database engine supports it. The current way of SELECT FROM ( SELECT WITH ROLLUP ) ORDER BY should be compatible with all supported database engines. Using SELECT WITH ROLLUP ORDER BY without a subquery could remove the added filesort from the rollup queries, improving the overall performance.
  4. Performance test against real data to ensure the performance does not degrade too much. This should not only look at the SQL query performance, but also the results being created. In the example queries each distinct custom_dimension_1 value will create a separate rollup row, in addition to the ones already present. This could lead to an increased data table size we should be aware of before committing to a full rollup implementation.

Local testing

Compared the current queries with the new implementation, both WITH and WITHOUT the COALESCE addition, as my local dataset contained several entries with idaction_url = NULL that changed the results.

Current state

SQL Query EXPLAIN Query Result Data

ROLLUP (with duplicate NULLs)

SQL Query EXPLAIN Query Result Data

The duplicate url = NULL rows are creating some problems with the data table. They are messing with the ranking counter and are not merged into the __mtm_ranking_query_others__ row.

The query is only listed to highlight the problems of potential NULL values in a grouping column.

ROLLUP with COALESCE

SQL Query EXPLAIN Query Result Data

Review

github-actions[bot] commented 3 weeks ago

If you don't want this PR to be closed automatically in 28 days then you need to assign the label 'Do not close'.