max-niederman / ttyper

Terminal-based typing test.
MIT License
1.05k stars 76 forks source link

feat(languages): Add english-ngrams #109

Closed heysokam closed 7 months ago

heysokam commented 7 months ago

Based on the app and wordlist from: https://github.com/ranelpadon/ngram-type

heysokam commented 7 months ago

They are still n-grams, not n-chars, even if they are using characters as their symbols. The term N-gram for this language set is technically more correct than calling a language dataset a "unigram where each symbol is a word". The case of ngrams symbols being words is the rare case, not the opposite.

I think the name is intuitive, it shows up on google for the person who doesn't know what they are, and wikipedia itself gives the right description for the concept (and even explains the context of unigrams where symbols are words). So I would say the more intuitive and pre-existing meaning should be kept.

max-niederman commented 7 months ago

They are still n-grams, not n-chars, even if they are using characters as their symbols. The term N-gram for this language set is technically more correct than calling a language dataset a "unigram where each symbol is a word".

This is not true; neither is more technically correct because "n-gram" is a very broad term and applies to both. That's why I'm hesitant to call only one "n-gram" as its distinguishing feature.

I think the name is intuitive, it shows up on google for the person who doesn't know what they are, and wikipedia itself gives the right description for the concept (and even explains the context of unigrams where symbols are words). So I would say the more intuitive and pre-existing meaning should be kept.

This is a valid point, though. "N-gram" is more searchable, at the very least because of ngram-type. I'm going to go ahead and merge this, although in v2 I think this'll need to be replaced by n-gram generation, which is already planned.