powa-team / pg_qualstats

A PostgreSQL extension for collecting statistics about predicates, helping find what indices are missing
Other
272 stars 26 forks source link

Add Extended statistics advisor (Poc) #33

Open yamatattsu opened 3 years ago

yamatattsu commented 3 years ago

Hi Julien!

I created PoC patch for Exnteded statistics advisor as I mentioned on PGCon2020 work. :-D The test case and result are below:

CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;

                                 QUERY PLAN
-----------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
 Planning Time: 0.122 ms
 Execution Time: 5.226 ms
(5 rows)

SELECT v
FROM json_array_elements(
     pg_qualstats_index_advisor(min_filter => 50)->'extstats') v
ORDER BY v::text COLLATE "C";

                          v
-----------------------------------------------------
 "CREATE STATISTICS t_b_a_ext ON b, a FROM public.t"
(1 row)

CREATE STATISTICS t_b_a_ext ON b, a FROM public.t;
CREATE STATISTICS

ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;

                                  QUERY PLAN
-------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
 Planning Time: 0.182 ms
 Execution Time: 4.430 ms
(5 rows)

Thanks, Tatsuro Yamada

rjuju commented 3 years ago

Thanks a lot for working on that @yamatattsu !

First, it should be done as of a new version, probably a 2.1.0, as it's adding some new feature rather than bugfix. I'll go ahead and commit preliminary work for that, so you don't have to deal with it in this patchset.

Then, regarding the feature in itself. I think it'd be better to create a new function, as they'll probably end up with a lot of differences. The most obvious one is probably the thresholds and filters. If you're using the same function, you'll discard quals that are used as part of index scan, which is probably a bad idea. Also, we should have some threshold based on selectivity estimation error here.

If needed we could of course pull out some part of the existing index advisors in functions / views so we don't have to copy/paste too many things.

The way I imagined that feature would be to work based on all compound predicates that have at least 2 simple predicates, with enough selectivity estimation error (ratio and/or raw num). Then, maybe optionally, try to validate that those errors are. really due to correlated columns. This could be done by checking that those simple predicates when used alone have good selectivity estimates. Of course there'll always be cases where compound predicates use simple predicates that are never used alone, so we can't make a hard requirement on those being available for a validation.