answers vs. playable words - Githubissues

n1k0 / wordlem

A simplistic port of the popular Wordle game in Elm.

https://n1k0.github.io/wordlem

Do What The F*ck You Want To Public License

17 stars 4 forks source link

answers vs. playable words #8

Closed phrawzty closed 2 years ago

phrawzty commented 2 years ago

Right now there's one set of words that comprise both the answers and the playable words. What if we split those sets up? One list for answers (i.e. common), and another for playable (i.e. valid) words that can be guessed. That way the player can guess all sorts of wild words (and we can be as wanton as we like with that list), but the answers will be from a more curated set.

Thoughts, @n1k0 ?

n1k0 commented 2 years ago

Sure thing, I'm working on it btw. I'll rely on usage frequencies and found these two good resources so far:

Peter Norvig's compilation of the 330k most frequent English words https://norvig.com/ngrams/count_1w.txt
OpenLexicon http://www.lexique.org/shiny/openlexicon/

Both resources provide statistics I can leverage to determine if a word is common or not (provided we find a good threshold). I'll open a PR as soon as code is ready to be testable.

phrawzty commented 2 years ago

Extremely interested in how you're calculating commonness and determining potential thresholds for this purpose. Would be willing to collab on this.

n1k0 commented 2 years ago

The idea would be for each word in the list to compute a ratio against the min and max freqency score of the whole list. Then, in the webapp, we could ask for eg. "the first 1500 n-letters words ranked by frequency DESC" as guessable words at a given length. I'm using the 1500 figure because IIRC that's what Wordle uses in their code.

Now, back to Norvig's list of most common English words used in Google searches, well… as you may suspect, it's crap, including common typos, city names, first-names, pop star names, insults and so on. I won't be able to use that list for English I'm afraid.

n1k0 commented 2 years ago

So back to square one, the goal is now — at least for English because I think I got French covered using OpenLexicon — is to find an equivalent of OpenLexicon for English, or at list a large resource of English words along their usage frequency metrics.

phrawzty commented 2 years ago

The word lists that I generated are a good "common" set to use for answers, imho. What we need are the words that are valid guesses, which could be lot larger.

Why not use the original word list—the one that we replaced—as the "valid words" and the word list I generated as the "common words". That seems like a reasonable first step / proof of concept. Can refine once the framework is in place if necessary, right?

n1k0 commented 2 years ago

Yeah that's more or less the plan, but I'd rather deal with a single list instead of two, and also find a generic building process for both English & French. Let me hack a few more hours on this concept until we decide to go the "easy" way.

n1k0 commented 2 years ago

@phrawzty I came up with something in #11. From my early tests, it does exactly what we want. Tell me what you think if you have some time to clone and run the branch.

n1k0 commented 2 years ago

I've pushed #11 to production, tell me what you think with the new sorted freq. based dictionaries. Feel free to reopen or file a new issue in case it can be improved :)