Closed Joxit closed 4 years ago
Wow, good times, thanks for pointing this out. @missinglink and I confirmed this morning that it is indeed quite slow.
Further investigation clearly required :)
For debugging, it's possible to edit lib/Queries.js
to set DEBUG = true
.
You can then run node cmd/cli.js "Largo Santa Maria Stella dell'Evangelizzazione, 6, 00144 Roma RM, Italy"
to get a list of all the SQL queries and their timing.
There are loads of queries being made, most of them are super fast but this one stood out to me:
The purpose of this query is to match twice on the
tokens
table and then ensure thatt1
is a child oft2
in the hierarchy usinglineage
as a pivot table.
SELECT t1.id AS subjectId, t2.id as objectId
FROM lineage AS l1
JOIN tokens AS t1 ON t1.id = l1.id
JOIN tokens AS t2 ON t2.id = l1.pid
WHERE t1.token = 'santa maria'
AND t2.token = 'roma'
AND (
t1.lang = t2.lang OR
t1.lang IN ( 'eng', 'und' ) OR
t2.lang IN ( 'eng', 'und' )
)
-- AND t1.tag NOT IN ( 'colloquial' )
-- AND t2.tag NOT IN ( 'colloquial' )
GROUP BY t1.id, t2.id
ORDER BY t1.id ASC, t2.id ASC
LIMIT '100'
took 1295ms
Using EXPLAIN QUERY PLAN
I it's possible to get information about how the table was searched:
sqlite> EXPLAIN QUERY PLAN
...> SELECT t1.id AS subjectId, t2.id as objectId
...> FROM lineage AS l1
...> JOIN tokens AS t1 ON t1.id = l1.id
...> JOIN tokens AS t2 ON t2.id = l1.pid
...> WHERE t1.token = 'santa maria'
...> AND t2.token = 'roma'
...> AND (
...> t1.lang = t2.lang OR
...> t1.lang IN ( 'eng', 'und' ) OR
...> t2.lang IN ( 'eng', 'und' )
...> )
...> -- AND t1.tag NOT IN ( 'colloquial' )
...> -- AND t2.tag NOT IN ( 'colloquial' )
...> GROUP BY t1.id, t2.id
...> ORDER BY t1.id ASC, t2.id ASC
...> LIMIT '100';
QUERY PLAN
|--SEARCH TABLE tokens AS t1 USING INDEX tokens_token_idx (token=?)
|--SEARCH TABLE tokens AS t2 USING INDEX tokens_token_idx (token=?)
|--SEARCH TABLE lineage AS l1 USING COVERING INDEX lineage_cover_idx (id=? AND pid=?)
`--USE TEMP B-TREE FOR GROUP BY
This shows that its doing what I hoped, which is using indices for all parts of the query.
So.. I don't know, needs more investigation, both those terms are fairly common, so I think that's compounding the issue:
sqlite> SELECT COUNT(*) FROM tokens WHERE token = 'santa maria';
1417
sqlite> SELECT COUNT(*) FROM tokens WHERE token = 'roma';
1586
This is much faster:
sqlite> SELECT l.id AS subjectId, l.pid as objectId
...> FROM lineage AS l
...> WHERE l.id IN (
...> SELECT id
...> FROM tokens
...> WHERE token = 'santa maria'
...> )
...> AND l.pid IN (
...> SELECT id
...> FROM tokens
...> WHERE token = 'roma'
...> );
1276758697|85685461
Run Time: real 0.087 user 0.080826 sys 0.006169
This is equivalent to the original query, it's twice as fast, but still slow IMO:
sqlite> WITH l AS (
...> SELECT *
...> FROM lineage AS l
...> WHERE l.id IN (
...> SELECT id
...> FROM tokens
...> WHERE token = 'santa maria'
...> )
...> AND l.pid IN (
...> SELECT id
...> FROM tokens
...> WHERE token = 'roma'
...> )
...> )
...> SELECT t1.id AS subjectId, t2.id as objectId
...> FROM l
...> JOIN tokens AS t1 ON t1.id = l.id
...> JOIN tokens AS t2 ON t2.id = l.pid
...> WHERE (
...> t1.lang = t2.lang OR
...> t1.lang IN ( 'eng', 'und' ) OR
...> t2.lang IN ( 'eng', 'und' )
...> )
...> -- AND t1.tag NOT IN ( 'colloquial' )
...> -- AND t2.tag NOT IN ( 'colloquial' )
...> GROUP BY t1.id, t2.id
...> ORDER BY t1.id ASC, t2.id ASC
...> LIMIT 100;
1276758697|85685461
Run Time: real 0.724 user 0.713129 sys 0.008378
diff --git a/query/match_subject_object.sql b/query/match_subject_object.sql
index d60f8b9..6612b23 100644
--- a/query/match_subject_object.sql
+++ b/query/match_subject_object.sql
@@ -1,10 +1,22 @@
+WITH l AS (
+ SELECT *
+ FROM lineage AS l
+ WHERE l.id IN (
+ SELECT id
+ FROM tokens
+ WHERE token = $subject
+ )
+ AND l.pid IN (
+ SELECT id
+ FROM tokens
+ WHERE token = $object
+ )
+)
SELECT t1.id AS subjectId, t2.id as objectId
-FROM lineage AS l1
- JOIN tokens AS t1 ON t1.id = l1.id
- JOIN tokens AS t2 ON t2.id = l1.pid
-WHERE t1.token = $subject
-AND t2.token = $object
-AND (
+FROM l
+JOIN tokens AS t1 ON t1.id = l.id
+JOIN tokens AS t2 ON t2.id = l.pid
+WHERE (
t1.lang = t2.lang OR
t1.lang IN ( 'eng', 'und' ) OR
t2.lang IN ( 'eng', 'und' )
I'm going to have to have a think about this some more, I don't fully understand why it's slow, the above method makes the query return in 1.89s
on my laptop instead of 3.03s
, but I don't see why it should be that slow 🤷
haha! got it:
sqlite> WITH l AS (
...> SELECT *
...> FROM lineage AS l
...> WHERE l.id IN (
...> SELECT id
...> FROM tokens
...> WHERE token = 'santa maria'
...> )
...> AND l.pid IN (
...> SELECT id
...> FROM tokens
...> WHERE token = 'roma'
...> )
...> )
...> SELECT l.id AS subjectId, l.pid as objectId
...> FROM l
...> JOIN tokens AS t1 ON t1.id = l.id
...> JOIN tokens AS t2 ON t2.id = l.pid
...> AND (
...> t1.lang = t2.lang OR
...> t1.lang IN ( 'eng', 'und' ) OR
...> t2.lang IN ( 'eng', 'und' )
...> )
...> -- AND t1.tag NOT IN ( 'colloquial' )
...> -- AND t2.tag NOT IN ( 'colloquial' )
...> GROUP BY l.id, l.pid
...> ORDER BY l.id ASC, l.pid ASC
...> LIMIT 100;
1276758697|85685461
Run Time: real 0.085 user 0.079535 sys 0.005683
opening PR...
Describe the bug I found a terrible long query (thanks to one of my customers...). The query took +5 seconds to response on both my own instance and the Geocode.earth demo instance. This issue is a memo for further investigation (maybe next week).
Steps to Reproduce The entry is
Largo Santa Maria Stella dell'Evangelizzazione, 6, 00144 Roma RM, Italy
Expected behavior
I don't know, response <1s ? At least check why this happens.
Additional context This was tested on world placeholder instance only.