maxtepkeev / architect

A set of tools which enhances ORMs written in Python with more features
Other
391 stars 57 forks source link

Running a filter hits all tables, instead of just one #56

Open teddyward opened 5 years ago

teddyward commented 5 years ago

Hi! Thank you for making this library. I am really happy that someone has done this. I am having trouble getting it to work as-expected, though (maybe related to #34 ?)

I have annotated a model as follows. This works rather nicely to shuttle incoming data into different tables:

@architect.install('partition', type='range', subtype='string_firstchars',
                   constraint='2', column='state')
class Parcel(models.Model):

I then query it as follows:

qs = Parcel.objects.filter(state='15', boundary__intersects=polygon)

I would expect this query to only try to hit the table distribution_parcel_15, but when I didn't see any performance improvement post-adding-partitions, I ran explain() on the above, and saw that every child table was being hit.

        ->  Index Scan using distribution_parcel_34_boundary_idx on distribution_parcel_34  (cost=0.41..8.68 rows=1 width=1089)
              Index Cond: (boundary && '0106000020E61000000100000001030000000100000005000000969EB6EA0DB763C060D33019BE503540F9408D560EB763C021FADE3CC45035402FD0AC070EB763C0E82FABB7FA503540406BAFC50DB763C0E625D2CAF8503540969EB6EA0DB763C060D33019BE503540'::geometry)
              Filter: (((state)::text = '15'::text) AND _st_intersects(boundary, '0106000020E61000000100000001030000000100000005000000969EB6EA0DB763C060D33019BE503540F9408D560EB763C021FADE3CC45035402FD0AC070EB763C0E82FABB7FA503540406BAFC50DB763C0E625D2CAF8503540969EB6EA0DB763C060D33019BE503540'::geometry))
        ->  Index Scan using distribution_parcel_12_boundary_idx on distribution_parcel_12  (cost=0.41..8.68 rows=1 width=1805)
              Index Cond: (boundary && '0106000020E61000000100000001030000000100000005000000969EB6EA0DB763C060D33019BE503540F9408D560EB763C021FADE3CC45035402FD0AC070EB763C0E82FABB7FA503540406BAFC50DB763C0E625D2CAF8503540969EB6EA0DB763C060D33019BE503540'::geometry)
              Filter: (((state)::text = '15'::text) AND _st_intersects(boundary, '0106000020E61000000100000001030000000100000005000000969EB6EA0DB763C060D33019BE503540F9408D560EB763C021FADE3CC45035402FD0AC070EB763C0E82FABB7FA503540406BAFC50DB763C0E625D2CAF8503540969EB6EA0DB763C060D33019BE503540'::geometry))
...

and EXPLAIN SELECT COUNT(*) FROM distribution_parcel WHERE state='15'; gives:

                     ->  Parallel Index Only Scan using state_34 on distribution_parcel_34  (cost=0.43..4.44 rows=1 width=0)
                           Index Cond: (state = '15'::text)
                     ->  Parallel Index Only Scan using state_12 on distribution_parcel_12  (cost=0.43..4.44 rows=1 width=0)
                           Index Cond: (state = '15'::text)
...

Am I fundamentally misunderstanding partitions, or what?

emord commented 5 years ago

@teddyward I ran into a similar issue recently and updated the docs around this here: https://architect.readthedocs.io/features/partition/postgresql.html#performance

https://github.com/maxtepkeev/architect/pull/55 should provide some more information, but basically a simple WHERE state = 'ab' won't be able to use the check constraint because the check constraints are constructed using a postgres function to get a substring. The short term fix for this is to replicate that function in your query.f

(note I'm not the project's maintainer) For the long term, I'd like to add a feature to architect that would allow for unique partitioning that would create simple check constraints such as state = 'ab', but have not had a chance and don't foresee doing that in the near future.

teddyward commented 5 years ago

Huh, thank you!!! This tip probably just reduced the runtime of my full script by about 29 days (99% or so). I was able to achieve that through Django by adding the following to my filter clause:

.extra(where=["lower(substr(state, 1, 2)) = '15'"])

~I couldn't find a cleaner way to do it, as the __icontains and __istartswith clauses produce slightly different queries.~ Edit: the substr class will probably do this more cleanly. But the above clause gives the exact explain that I wanted and a MUCH faster query!

        ->  Index Scan using distribution_parcel_15_boundary_idx on distribution_parcel_15  (cost=0.28..8.56 rows=1 width=1567)
              Index Cond: (boundary && '0106000020E61000000100000001030000000100000005000000969EB6EA0DB763C060D33019BE503540F9408D560EB763C021FADE3CC45035402FD0AC070EB763C0E82FABB7FA503540406BAFC50DB763C0E625D2CAF8503540969EB6EA0DB763C060D33019BE503540'::geometry)
              Filter: ((lower(substr((state)::text, 1, 2)) = '15'::text) AND _st_intersects(boundary, '0106000020E61000000100000001030000000100000005000000969EB6EA0DB763C060D33019BE503540F9408D560EB763C021FADE3CC45035402FD0AC070EB763C0E82FABB7FA503540406BAFC50DB763C0E625D2CAF8503540969EB6EA0DB763C060D33019BE503540'::geometry))