gkunter / coquery

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse a text corpus.
GNU General Public License v3.0
18 stars 4 forks source link

Queries involving N-gram lookup tables should be optimized #286

Open gkunter opened 7 years ago

gkunter commented 7 years ago

Currently, a query such as * more happy is likely to trigger a full table scan of an N-gram lookup table. This is so because the corpus builder creates a multiple column index that includes all word columns in a left-to-right order. This means that if the left-most query item is not specified, the multiple column index can't be used.

There are several ways this behavior could be optimized. One way could be to redesign the query strings so that it's guaranteed that the first column of the N-gram lookup table is always specified, and can therefore always be queried by an index. In the example above, this would result in a query string that might look like this:

SELECT L1.Word, L2.Word, L3.Word, C1.ID
FROM CorpusNgram
INNER JOIN Corpus AS C1 ON C1.ID = CorpusNgram.ID1 - 1
INNER JOIN Lexicon AS L1 ON L1.WordId = C1.WordId
INNER JOIN Lexicon AS L2 ON L2.WordId = CorpusNgram.WordId1
INNER JOIN Lexicon AS L3 ON L3.WordId = CorpusNgram.WordId2
WHERE (L1.Word = "more") AND (L2.Word = "happy")

Another optimization strategy might involve the addition of more multiple column indexes, e.g. on (WordId2, WordId3, WordId4, WordId5), (WordId3, WordId4, WordId5), and (WordId4, WordId5). However, this approach would probably result in very large databases, especially if all combinations should be covered.