High response times on some queries

Joxit commented 4 years ago

Describe the bug I found a terrible long query (thanks to one of my customers...). The query took +5 seconds to response on both my own instance and the Geocode.earth demo instance. This issue is a memo for further investigation (maybe next week).

Steps to Reproduce The entry is Largo Santa Maria Stella dell'Evangelizzazione, 6, 00144 Roma RM, Italy

time curl 'http://127.0.0.1:3000/parser/search?text=Largo%20Santa%20Maria%20Stella%20dell%27Evangelizzazione%2C%206%2C%2000144%20Roma%20RM%2C%20Italy&lang=eng'

Expected behavior

I don't know, response <1s ? At least check why this happens.

Additional context This was tested on world placeholder instance only.

orangejulius commented 4 years ago

Wow, good times, thanks for pointing this out. @missinglink and I confirmed this morning that it is indeed quite slow.

Further investigation clearly required :)

missinglink commented 4 years ago

For debugging, it's possible to edit lib/Queries.js to set DEBUG = true.

You can then run node cmd/cli.js "Largo Santa Maria Stella dell'Evangelizzazione, 6, 00144 Roma RM, Italy" to get a list of all the SQL queries and their timing.

There are loads of queries being made, most of them are super fast but this one stood out to me:

The purpose of this query is to match twice on the tokens table and then ensure that t1 is a child of t2 in the hierarchy using lineage as a pivot table.

SELECT t1.id AS subjectId, t2.id as objectId
FROM lineage AS l1
  JOIN tokens AS t1 ON t1.id = l1.id
  JOIN tokens AS t2 ON t2.id = l1.pid
WHERE t1.token = 'santa maria'
AND t2.token = 'roma'
AND (
  t1.lang = t2.lang OR
  t1.lang IN ( 'eng', 'und' ) OR
  t2.lang IN ( 'eng', 'und' )
)
-- AND t1.tag NOT IN ( 'colloquial' )
-- AND t2.tag NOT IN ( 'colloquial' )
GROUP BY t1.id, t2.id
ORDER BY t1.id ASC, t2.id ASC
LIMIT '100'

took 1295ms

Using EXPLAIN QUERY PLAN I it's possible to get information about how the table was searched:

sqlite> EXPLAIN QUERY PLAN
   ...> SELECT t1.id AS subjectId, t2.id as objectId
   ...> FROM lineage AS l1
   ...>   JOIN tokens AS t1 ON t1.id = l1.id
   ...>   JOIN tokens AS t2 ON t2.id = l1.pid
   ...> WHERE t1.token = 'santa maria'
   ...> AND t2.token = 'roma'
   ...> AND (
   ...>   t1.lang = t2.lang OR
   ...>   t1.lang IN ( 'eng', 'und' ) OR
   ...>   t2.lang IN ( 'eng', 'und' )
   ...> )
   ...> -- AND t1.tag NOT IN ( 'colloquial' )
   ...> -- AND t2.tag NOT IN ( 'colloquial' )
   ...> GROUP BY t1.id, t2.id
   ...> ORDER BY t1.id ASC, t2.id ASC
   ...> LIMIT '100';

QUERY PLAN
|--SEARCH TABLE tokens AS t1 USING INDEX tokens_token_idx (token=?)
|--SEARCH TABLE tokens AS t2 USING INDEX tokens_token_idx (token=?)
|--SEARCH TABLE lineage AS l1 USING COVERING INDEX lineage_cover_idx (id=? AND pid=?)
`--USE TEMP B-TREE FOR GROUP BY

This shows that its doing what I hoped, which is using indices for all parts of the query.

So.. I don't know, needs more investigation, both those terms are fairly common, so I think that's compounding the issue:

sqlite> SELECT COUNT(*) FROM tokens WHERE token = 'santa maria';
1417

sqlite> SELECT COUNT(*) FROM tokens WHERE token = 'roma';
1586

missinglink commented 4 years ago

This is much faster:

sqlite> SELECT l.id AS subjectId, l.pid as objectId
   ...> FROM lineage AS l
   ...> WHERE l.id IN (
   ...>   SELECT id
   ...>   FROM tokens
   ...>   WHERE token = 'santa maria'
   ...> )
   ...> AND l.pid IN (
   ...>   SELECT id
   ...>   FROM tokens
   ...>   WHERE token = 'roma'
   ...> );
1276758697|85685461

Run Time: real 0.087 user 0.080826 sys 0.006169

This is equivalent to the original query, it's twice as fast, but still slow IMO:

sqlite> WITH l AS (
   ...>   SELECT *
   ...>   FROM lineage AS l
   ...>   WHERE l.id IN (
   ...>     SELECT id
   ...>     FROM tokens
   ...>     WHERE token = 'santa maria'
   ...>   )
   ...>   AND l.pid IN (
   ...>     SELECT id
   ...>     FROM tokens
   ...>     WHERE token = 'roma'
   ...>   )
   ...> )
   ...> SELECT t1.id AS subjectId, t2.id as objectId
   ...> FROM l
   ...> JOIN tokens AS t1 ON t1.id = l.id
   ...> JOIN tokens AS t2 ON t2.id = l.pid
   ...> WHERE (
   ...>   t1.lang = t2.lang OR
   ...>   t1.lang IN ( 'eng', 'und' ) OR
   ...>   t2.lang IN ( 'eng', 'und' )
   ...> )
   ...> -- AND t1.tag NOT IN ( 'colloquial' )
   ...> -- AND t2.tag NOT IN ( 'colloquial' )
   ...> GROUP BY t1.id, t2.id
   ...> ORDER BY t1.id ASC, t2.id ASC
   ...> LIMIT 100;
1276758697|85685461

Run Time: real 0.724 user 0.713129 sys 0.008378

missinglink commented 4 years ago

diff --git a/query/match_subject_object.sql b/query/match_subject_object.sql
index d60f8b9..6612b23 100644
--- a/query/match_subject_object.sql
+++ b/query/match_subject_object.sql
@@ -1,10 +1,22 @@
+WITH l AS (
+  SELECT *
+  FROM lineage AS l
+  WHERE l.id IN (
+    SELECT id
+    FROM tokens
+    WHERE token = $subject
+  )
+  AND l.pid IN (
+    SELECT id
+    FROM tokens
+    WHERE token = $object
+  )
+)
 SELECT t1.id AS subjectId, t2.id as objectId
-FROM lineage AS l1
-  JOIN tokens AS t1 ON t1.id = l1.id
-  JOIN tokens AS t2 ON t2.id = l1.pid
-WHERE t1.token = $subject
-AND t2.token = $object
-AND (
+FROM l
+JOIN tokens AS t1 ON t1.id = l.id
+JOIN tokens AS t2 ON t2.id = l.pid
+WHERE (
   t1.lang = t2.lang OR
   t1.lang IN ( 'eng', 'und' ) OR
   t2.lang IN ( 'eng', 'und' )

missinglink commented 4 years ago

I'm going to have to have a think about this some more, I don't fully understand why it's slow, the above method makes the query return in 1.89s on my laptop instead of 3.03s, but I don't see why it should be that slow 🤷

missinglink commented 4 years ago

haha! got it:

sqlite> WITH l AS (
   ...>   SELECT *
   ...>   FROM lineage AS l
   ...>   WHERE l.id IN (
   ...>     SELECT id
   ...>     FROM tokens
   ...>     WHERE token = 'santa maria'
   ...>   )
   ...>   AND l.pid IN (
   ...>     SELECT id
   ...>     FROM tokens
   ...>     WHERE token = 'roma'
   ...>   )
   ...> )
   ...> SELECT l.id AS subjectId, l.pid as objectId
   ...> FROM l
   ...>   JOIN tokens AS t1 ON t1.id = l.id
   ...>   JOIN tokens AS t2 ON t2.id = l.pid
   ...> AND (
   ...>   t1.lang = t2.lang OR
   ...>   t1.lang IN ( 'eng', 'und' ) OR
   ...>   t2.lang IN ( 'eng', 'und' )
   ...> )
   ...> -- AND t1.tag NOT IN ( 'colloquial' )
   ...> -- AND t2.tag NOT IN ( 'colloquial' )
   ...> GROUP BY l.id, l.pid
   ...> ORDER BY l.id ASC, l.pid ASC
   ...> LIMIT 100;
1276758697|85685461

Run Time: real 0.085 user 0.079535 sys 0.005683

opening PR...

pelias / placeholder

High response times on some queries #193