Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan:

pharo-ai / NgramModel

Ngram language model implemented in Pharo

MIT License

4 stars 4 forks source link

Am still not sure about a good solution for reducing vocabulary, but I think history needs to be additionally reduced too, perhaps using something like this:

historyCounts := historyCounts rejectWithOccurrences: [ :each :count |
        count < aNumber ]

Because technically, if we are reducing the ngrams using the same threshold, then those words we are "throwing out" of ngramCounts will have the same or even higher occurence in historyCounts and therefore should be safe to delete. Tell me what you think, @olekscode.

pharo-ai / NgramModel

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21