pharo-ai / NgramModel

Ngram language model implemented in Pharo
MIT License
4 stars 4 forks source link

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21

Open myroslavarm opened 4 years ago

myroslavarm commented 4 years ago

In the method we are deleting ngrams and reducing history counts, i think vocabulary needs to be cleaned up too (when word history becomes zero, for instance).

The main idea of this method is to get rid of tokens and their sequences that we find irrelevant, in order to speed up reading from file or lookup within the model. And in this case always keeping all the vocabulary entries defeats the purpose.

myroslavarm commented 4 years ago

Am still not sure about a good solution for reducing vocabulary, but I think history needs to be additionally reduced too, perhaps using something like this:

historyCounts := historyCounts rejectWithOccurrences: [ :each :count |
        count < aNumber ]

Because technically, if we are reducing the ngrams using the same threshold, then those words we are "throwing out" of ngramCounts will have the same or even higher occurence in historyCounts and therefore should be safe to delete. Tell me what you think, @olekscode.