skedwards88 / word_lists

Lists of words divided by common vs uncommon words
Other
3 stars 0 forks source link

Word invalid #6

Open daveberzack opened 1 year ago

daveberzack commented 1 year ago

Crossjig version 1.0.27

Ales is a valid word. It's the plural of ale.

skedwards88 commented 1 year ago

Thanks for this issue!

Right now, Crossjig generates puzzles from the common word lists here and only accepts solutions that consist of those "common" words. I play tested accepting all words as solutions, but the result was unsatisfying because there are a lot of letter combinations that most people do not recognize as words.

I'm working on a "somewhat common words" list for words like "ales" that are recognizable as words (so we should accept them in a user's solution) but that might not be recognizable to many users (so we shouldn't use them to generate the puzzle).

jorendorff commented 1 year ago

This issue belongs in the word_lists repo, I guess.

It's a really interesting question how to get ALES included. I agree it's 100% a word. But in the plural it's quite rare, which makes it hard to formulate an evidence-based rule that includes ALES but excludes non-words.

To illustrate the problem: in this data set based on OpenSubtitles.org, the words that are exactly as frequent as ALES include ZAMA, CAUSAL, JAMAIS, BLEEDERS, SUBPRIME, MARYANNE, and DINARD.

So I think this will require some data source that reports on something other than how frequently a word is actually used. Maybe: treat a known-common word like ALE as evidence in favor of its derived forms that are, say, listed in Wiktionary.

skedwards88 commented 1 year ago

Based on data pulled from https://github.com/IlyaSemenov/wikipedia-word-frequency, which lists the word frequency from ~2.6 million Wikipedia entries (based on notes I took a year ago, though I'm not sure where I got that number), ales appears 1702 times. Other words listed with that frequency are eo, pediments, dukla, crandall, ales, frontrunner, trobe, persephone, pentax, rialto, kleiner, zlín. ale is listed 7537 times, similar to predation, ford's, incapable, hua, isotope, expansions, ale, patty, nectar, condensed. Unfortunately, the list doesn't distinguish between "ales" as a noun vs "ales" as a name.

Maybe: treat a known-common word like ALE as evidence in favor of its derived forms that are, say, listed in Wiktionary.

Are you suggesting we add an "S" to all words in the "common" list and, if the resulting word is an actual word, we include it in the "common" list?

jorendorff commented 1 year ago

That's not what I meant, I was hoping there'd be some fairly authoritative source to tell us what the plurals are.

But what you suggested isn't crazy. I've even considered writing code to try and detect whether a word is a plural (or singular verb), in order to try to make them occur less often in puzzles (that and -ing).

I think the rules are:

I wonder if this would result in any false positives—rare words that we don't want to include, but they just happen to be spelled exactly the same as what that algorithm produces when applied to a word in the common words list. One possible way would be when there's a common adjective or adverb, say seldom, which maybe happens to be the same as some incredibly rare noun, so that SELDOMS ends up being included as a word. I'm not saying there is such a word, but it's possible and that would be a little unfortunate. I'm OK with including some marginal words like WOWS and HOORAYS, and GROSSES (when gross is rare as a noun or verb but very common as an adjective).