gkunter / coquery

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse a text corpus.
GNU General Public License v3.0
18 stars 4 forks source link

Pre-calculated corpus size lists required #20

Open gkunter opened 9 years ago

gkunter commented 9 years ago

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE: Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).

SOLUTION: During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.


gkunter commented 9 years ago

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):


This issue could be solved by implementing Issue #34 for corpus source columns. The counts could be stored in a collection.Counter object, with a tuple (containing the values of the source columns) as keys.This would save the slow calculation of cross-tables with COUNT(*) after the corpus has been compiled.

What is needed, then, is simply that a list of all features that are source features (i.e. non-word features, possibly excluding time features) is available to the corpus builder.