pelias / placeholder

stand-alone coarse geocoder
https://placeholder.demo.geocode.earth
MIT License
313 stars 35 forks source link

rewrite query for performance #213

Closed missinglink closed 2 years ago

missinglink commented 2 years ago

this PR applies two query optimisations:

note: there were previously two distinct .sql files on disk, these have now been combined, so I've removed the redundant file and renamed the other.

related https://github.com/pelias/placeholder/pull/210

implementation notes for future reference:

The purpose of this query is to take two 'tokens' which are referred to as the 'subject' and the 'object', in LTR languages these represent the left & right tokens, eg: subject:'New York', object:'USA'.

ignoring the fulltext table for now, we take the tokens table, look up all occurrences of 'New York' and again all occurrences of 'USA', this generates a Cross Product (all occurrences of 'New York' matched with all occurrences of 'USA'), which can be a very large resultset.

there are three conditions applied to restrict which of these subject+object pairs are valid, namely an rtree check (to ensure the two have overlapping bounds (x,y dimensions)), a z-dimension check on the rtree which ensures that the subject has a lower rank than the object (rank is a numerical equivalent to WOF placetype) and a language filter which says that both tokens need to be either the same language or one or the other is either English or Undefined.

the fulltext table is then joined to tokens using the rowid column, this is always a 1:1 mapping, using the index generated by the FTS5 module is significantly faster, particularly for prefix matching.

I assumed that the query planner would optimise the previous version of this query and this version to the same query internally, however performance testing shows this doesn't seem to be the case.

hat tip to @naderman for his help at the pub 🥂