Add notes on performance of string partitioning

I recently ran into an issue with some tables I had partitioned by string_lastchars with a constraint of 3. I attempted to join several tables together like:

SELECT *
FROM tbl_1
LEFT OUTER JOIN tbl_2 ON tbl_2.id = tbl_1.id AND tbl_2.partitioned_column = tbl_3.partitioned_column
WHERE tbl_1.partioned_column = 'long_string_xyz'

I was expecting a query plan that did a sequential scan in tables tbl_1_xyz and tbl_2_xyz and used a hash join to combine them together.

However what I found was that it attempted a sequential scan on all tables (tbl_1_abc, tbl_1_def, ... tbl_2_xyz) and attempted to perform a nested loop join (which resulted in each join going through millions of loops, instead of a hash lookup).

I fixed the partitioning issue by changing the query to (functions from the sql in partition.py):

SELECT *
FROM tbl_1
LEFT OUTER JOIN tbl_2 ON tbl_2.id = tbl_1.id AND tbl_2.partitioned_column = tbl_3.partitioned_column
WHERE tbl_1.partioned_column = 'long_string_xyz' 
             AND lower(substr(tbl_1.partitioned_column, '{.3$}'::text)) = 'xyz'
             AND lower(substr(tbl_2.partitioned_column, '{.3$}'::text)) = 'xyz'

And I fixed the nested loops issue by adding indexes to lower(substr(partitioned_column, '{.3$}')) to each table and partition and running an analyze. My theory is that postgres was using statistic information for the partitioned column from the parent table, so if 10% of records were in tbl_1_xyz then postgres would think that the following query would return 10% of the rows it would actually return:

SELECT COUNT(*)
FROM tbl_1_xyz
WHERE lower(substr(tbl_1.partitioned_column, '{.3$}'::text)) = 'xyz'

I don't have an automated solution for this at the moment other than "be careful and check your query planner for any large queries you may have", so for now I decided to just add a note in the documentation on what I thought may be non-obvious behavior.

maxtepkeev / architect

Add notes on performance of string partitioning #55