simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.47k stars 677 forks source link

Research using CTEs for faster facet counts #1259

Open simonw opened 3 years ago

simonw commented 3 years ago

https://www.sqlite.org/changes.html#version_3_35_0

Add support for the MATERIALIZED and NOT MATERIALIZED hints when specifying common table expressions. The default behavior was formerly NOT MATERIALIZED, but is now changed to MATERIALIZED for CTEs that are used more than once.

If a CTE creates a table that is used multiple time in that query, SQLite will now default to creating a materialized table for the duration of that query.

This could be a big performance boost when applying faceting multiple times against the same query. Consider this example query:

WITH data as (
  select
    *
  from
    [global-power-plants]
),
country_long as (select 
  'country_long' as col, country_long as value, count(*) as c from data group by country_long
  order by c desc limit 10
),
primary_fuel as (
select
  'primary_fuel' as col, primary_fuel as value, count(*) as c from data group by primary_fuel
  order by c desc limit 10
)
select * from primary_fuel union select * from country_long order by col, c desc

https://global-power-plants.datasettes.com/global-power-plants?sql=WITH+data+as+%28%0D%0A++select%0D%0A++++*%0D%0A++from%0D%0A++++%5Bglobal-power-plants%5D%0D%0A%29%2C%0D%0Acountry_long+as+%28select+%0D%0A++%27country_long%27+as+col%2C+country_long+as+value%2C+count%28*%29+as+c+from+data+group+by+country_long%0D%0A++order+by+c+desc+limit+10%0D%0A%29%2C%0D%0Aprimary_fuel+as+%28%0D%0Aselect%0D%0A++%27primary_fuel%27+as+col%2C+primary_fuel+as+value%2C+count%28*%29+as+c+from+data+group+by+primary_fuel%0D%0A++order+by+c+desc+limit+10%0D%0A%29%0D%0Aselect+*+from+primary_fuel+union+select+*+from+country_long+order+by+col%2C+c+desc

Outputs:

col value c
country_long United States of America 8688
country_long China 4235
country_long United Kingdom 2603
country_long Brazil 2360
country_long France 2155
country_long India 1590
country_long Germany 1309
country_long Canada 1159
country_long Spain 829
country_long Russia 545
primary_fuel Solar 9662
primary_fuel Hydro 7155
primary_fuel Wind 5188
primary_fuel Gas 3922
primary_fuel Coal 2390
primary_fuel Oil 2290
primary_fuel Biomass 1396
primary_fuel Waste 1087
primary_fuel Nuclear 198
primary_fuel Geothermal 189
simonw commented 3 years ago

https://sqlite.org/lang_with.html

Prior to SQLite 3.35.0, all CTEs where treated as if the NOT MATERIALIZED phrase was present

It looks like this optimization is completely unavailable on SQLite prior to 3.35.0 (released 12th March 2021). But I could still rewrite the faceting to work in this way, using the exact same SQL - it would just be significantly faster on 3.35.0+ (assuming it's actually faster in practice - would need to benchmark).

simonw commented 3 years ago

OK, a better comparison:

https://global-power-plants.datasettes.com/global-power-plants?sql=WITH+data+as+%28%0D%0A++select%0D%0A++++*%0D%0A++from%0D%0A++++%5Bglobal-power-plants%5D%0D%0A%29%2C%0D%0Acountry_long+as+%28select+%0D%0A++%27country_long%27+as+col%2C+country_long+as+value%2C+count%28*%29+as+c+from+data+group+by+country_long%0D%0A++order+by+c+desc+limit+31%0D%0A%29%2C%0D%0Aprimary_fuel+as+%28%0D%0Aselect%0D%0A++%27primary_fuel%27+as+col%2C+primary_fuel+as+value%2C+count%28*%29+as+c+from+data+group+by+primary_fuel%0D%0A++order+by+c+desc+limit+31%0D%0A%29%2C%0D%0Aowner+as+%28%0D%0Aselect%0D%0A++%27owner%27+as+col%2C+owner+as+value%2C+count%28*%29+as+c+from+data+group+by+owner%0D%0A++order+by+c+desc+limit+31%0D%0A%29%0D%0Aselect+*+from+primary_fuel+union+select+*+from+country_long%0D%0Aunion+select+*+from+owner+order+by+col%2C+c+desc calculates facets against three columns. It takes 78.5ms (and 34.5ms when I refreshed it, presumably after warming some SQLite caches of some sort).

https://global-power-plants.datasettes.com/global-power-plants/global-power-plants?_facet=country_long&_facet=primary_fuel&_trace=1&_size=0 shows those facets with size=0 on the SQL query - and shows a SQL trace at the bottom of the page.

The country_long facet query takes 45.36ms, owner takes 38.45ms, primary_fuel takes 49.04ms - so a total of 132.85ms

That's against https://global-power-plants.datasettes.com/-/versions says SQLite 3.27.3 - so even on a SQLite version that doesn't materialize the CTEs there's a significant performance boost to doing all three facets in a single CTE query.

simonw commented 3 years ago

I wonder if I could optimize facet suggestion in the same way?

One challenge: the query time limit will apply to the full CTE query, not to the individual columns.

simonw commented 3 years ago

If all of the facets were being calculated in a single query, I'd be willing to bump the facet time limit up to something a lot higher, maybe even a full second. There's a chance that could work amazingly well with a materialized CTE.

simonw commented 3 years ago

CTEs were added in 2014-02-03 SQLite 3.8.3 - so I think it's OK to depend on them for Datasette.