Open gkunter opened 9 years ago
Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):
This issue could be solved by implementing Issue #34 for corpus source columns. The counts could be stored in a collection.Counter object, with a tuple (containing the values of the source columns) as keys.This would save the slow calculation of cross-tables with COUNT(*) after the corpus has been compiled.
What is needed, then, is simply that a list of all features that are source features (i.e. non-word features, possibly excluding time features) is available to the corpus builder.
Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)
ISSUE: Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).
SOLUTION: During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.