rewrite the JOIN statements in such a way that the query planner is able to apply the filter predicates earlier when generating the cross-product, which greatly improves performance for queries where both the subject and object are common terms.
note: there were previously two distinct .sql files on disk, these have now been combined, so I've removed the redundant file and renamed the other.
the MATCH syntax of the FTS5 virtual table requires quoting, hence "${subject}"
The purpose of this query is to take two 'tokens' which are referred to as the 'subject' and the 'object', in LTR languages these represent the left & right tokens, eg: subject:'New York', object:'USA'.
ignoring the fulltext table for now, we take the tokens table, look up all occurrences of 'New York' and again all occurrences of 'USA', this generates a Cross Product (all occurrences of 'New York' matched with all occurrences of 'USA'), which can be a very large resultset.
there are three conditions applied to restrict which of these subject+object pairs are valid, namely an rtree check (to ensure the two have overlapping bounds (x,y dimensions)), a z-dimension check on the rtree which ensures that the subject has a lower rank than the object (rank is a numerical equivalent to WOF placetype) and a language filter which says that both tokens need to be either the same language or one or the other is either English or Undefined.
the fulltext table is then joined to tokens using the rowid column, this is always a 1:1 mapping, using the index generated by the FTS5 module is significantly faster, particularly for prefix matching.
I assumed that the query planner would optimise the previous version of this query and this version to the same query internally, however performance testing shows this doesn't seem to be the case.
this PR applies two query optimisations:
fulltext
virtual index instead of thetoken
index for all string matching (as per https://github.com/pelias/placeholder/pull/210)JOIN
statements in such a way that the query planner is able to apply the filter predicates earlier when generating the cross-product, which greatly improves performance for queries where both thesubject
andobject
are common terms.related https://github.com/pelias/placeholder/pull/210
implementation notes for future reference:
MATCH
syntax of theFTS5
virtual table requires quoting, hence"${subject}"
The purpose of this query is to take two 'tokens' which are referred to as the 'subject' and the 'object', in LTR languages these represent the left & right tokens, eg: subject:'New York', object:'USA'.
ignoring the
fulltext
table for now, we take thetokens
table, look up all occurrences of 'New York' and again all occurrences of 'USA', this generates a Cross Product (all occurrences of 'New York' matched with all occurrences of 'USA'), which can be a very large resultset.there are three conditions applied to restrict which of these subject+object pairs are valid, namely an
rtree
check (to ensure the two have overlapping bounds (x,y dimensions)), a z-dimension check on the rtree which ensures that the subject has a lower rank than the object (rank is a numerical equivalent to WOF placetype) and a language filter which says that both tokens need to be either the same language or one or the other is either English or Undefined.the
fulltext
table is then joined totokens
using therowid
column, this is always a1:1
mapping, using the index generated by the FTS5 module is significantly faster, particularly for prefix matching.I assumed that the query planner would optimise the previous version of this query and this version to the same query internally, however performance testing shows this doesn't seem to be the case.
hat tip to @naderman for his help at the pub 🥂