hackforla / 311-data

Empowering Neighborhood Associations to improve the analysis of their initiatives using 311 data
https://hackforla.github.io/311-data/
GNU General Public License v3.0
62 stars 63 forks source link

Improve IN Performance #499

Closed sellnat77 closed 4 years ago

sellnat77 commented 4 years ago

Description

Using in in where clauses will cause latency issues with large tables, review this stackoverflow for legitimacy and implement or investigate other means of improvement

Action Items

Resources

https://stackoverflow.com/a/17824797

sellnat77 commented 4 years ago

This ticket is heavily ivestigative, i put in my hunch for the ticket description but it very well could be something else, feel free to explore all avenues

jmensch1 commented 4 years ago

So I did some investigation of alternatives to the IN query. Bullet points:

IN query (a baseline)

So starting with the basics, I did an IN query for a one-year period using the fully populated database (2015 through 2020). Starting with 1 NC, and then working my way up to 100. I ran each query seven times (in pure SQL, no ORM), and took the average of the 7 runs. These are the results:

IN_query

As you can see, the query was reasonably fast up to 20 NCs, but became super-slow after that.

Using JOIN instead of IN

IN queries can be rewritten to use join-on-values syntax. For example, a basic IN:

SELECT requesttype, srnumber, latitude, longitude
FROM ingest_staging_table
WHERE nc IN (4, 5, 6)

can be rewritten like this:

SELECT requesttype, srnumber, latitude, longitude
FROM ingest_staging_table
INNER JOIN (
VALUES (4),(5),(6)) vals(v)
ON (nc = v)

I ran the same test as above except using JOIN queries. Unfortunately, this was also pretty slow when run against the full database.

JOIN_query

So the JOIN alternative started getting slow with only 10 ncs. Comparing the two, you can see that the IN query is faster until you get to 30NCs. After that, JOIN is faster, but not by a lot.

IN_vs_JOIN

Anti-Join

The stackoverflow post above mentioned an anti-join query. Those queries are basically NOT IN queries expressed as JOINs. So an anti-join looks like this:

SELECT requesttype, srnumber, latitude, longitude 
FROM ingest_staging_table
LEFT OUTER JOIN 
(VALUES (4),(5),(6)) vals(v)
ON (nc = v)
WHERE v IS NULL

This query would return the rows where the nc column is NOT in [4, 5, 6].

This type of query could be useful in theory. For example if the user submitted 95 NCs, we could run an anti-join using the remaining 5 NCs. Unfortunately, the anti-join ran slower than the IN and the JOIN:

Anti-Join query

All of these tests were run on my MacBook using pure SQL queries. The code is here.