gkunter / coquery

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse a text corpus.
GNU General Public License v3.0
18 stars 4 forks source link

Queries that start with wildcards are slow #285

Open gkunter opened 7 years ago

gkunter commented 7 years ago

Due to the way that indexes work in MySQL (and probably also SQLite), querying for suffixes such as *ing can be unexpectedly slow, as this usually results in a full table scan of the lexicon table (for details, see the answers and comments to this StackExchange question).

There's no easy fix. One way of making suffix queries acceptable could be to add additional columns in the lexicon table that stores the reversed words and lemmas. Then, the query string generator would check whether the query item starts with a wildcard, and if so would match the reversed query string with the columns containing the reversed values.

Perhaps this could be done as an indexed generated column in MySQL:

ALTER TABLE Lexicon ADD COLUMN Word_rev VARCHAR(33) GENERATED ALWAYS AS (REVERSE(Word));
CREATE INDEX Word_rev ON Lexicon(Word_rev);

However, this increases the lexicon table, but more importantly, it doesn't fix issues with query strings such as *in*, i.e. infix queries. The common solution to this problem appears to be based on a trigram table. This approach appears to create a table of trigrams, and then store the keys to the trigrams in the table that contains the strings to be queried. A description of this approach can be found in this blog post; also see this SE question for a related approach.

The problem is that storing fragments will cause quite an increase in required disk space; especially with COCA where the Lexicon table is so huge already.