retextjs / retext-spell

plugin to check spelling
https://unifiedjs.com
MIT License
71 stars 16 forks source link

Unexpected capital letters returned for certain capitalized misspellings #20

Closed tvquizphd closed 3 years ago

tvquizphd commented 3 years ago

TLDR: see PR 38 and PR 39 that I've opened against nspell.

Subject of the issue

Note: I've changed the name of this issue from "Mysterious capital E returned for misspelled 5-letter nouns with a single capital T" to "Unexpected capital letters returned for certain capitalized misspellings," and I've edited this post slightly to reflect the broader scope.

Background

I had originally found this error for capitalized variants of 16 dictionary-en words: "tepee", "thane", "thole", "three", "throe", "tilde", "tinge", "tonne", "toque", "tribe", "trike", "trope", "trove", "truce", "tuque", and "twine". To give one notable example, any misspelling matching this RegEx /^Thre[f-ln-racuvxyz]$/ is corrected to "ThreE" instead of "Three."

Edit The below algorithm produces misspellings of the original 16 dictionary words, but I have since found 190 additional 5 letter words that occasionally occur in retext-spell vfile messages with extraneous capital letters. I have saved these new words and the list of misspellings needed to generate them in a json file bundled with the gist for this issue.

Generating examples

The gist to reproduce this issue tests misspellings generated as such:

If the misspellings do not match a different dictionary word more closely than the originally selected 5-letter word, then the first "expected" value in the vfile message emitted by retext-spell will be the originally selected 5-letter word with final "e" mistakenly capitalized as "E".

Edit Without getting into the details of nspell's keyboard groups, there is no easy way to generate the 190 newly discovered 5-letter words that do not match the misspellings generated with the above method.

Your environment

Steps to reproduce

I've created a gist.

Execute the following commands to download the gist and install dependencies:

git clone https://gist.github.com/a4e2ff11cd868a5b40a65b3c53c8574a.git
cd a4e2ff11cd868a5b40a65b3c53c8574a
npm install

Run one of the following commands to test with various suffixes:

Side note In contrast to the examples that produce the bug defined in this issue, you can run npm run test "t", npm run test "ts", and npm run test "t's" to see the results of misspellings that fail to produce the bug due to the presence of a lowercase "t" in the misspelling.

Expected behavior

All the logged vfile message reasons should show suggested values without unusual capitalization. The hundreds of misspellings tested with npm run test "*", npm run test "*s", and npm run test "*'s" should generate suggested values with lowercase "e" characters. For example, the first tested misspelling Tepea should generate a top suggested value of "Tepee". The plural Tepeas should generate a top suggested value of "Tepees". The possessive Tepea's should generate a top suggested value of "Tepee's".

Actual behavior

The hundreds of misspellings tested with npm run test "*", npm run test "*s", and npm run test "*'s" all generate suggested values with uppercase "E" characters. For example, the first tested misspelling Tepea generates a top suggested value of "TepeE". The plural Tepeas generates a top suggested value of "TepeEs". The possessive Tepea's generates a top suggested value of "TepeE's".

tvquizphd commented 3 years ago

Along a similar vein, these misspellings of Tinpot result in an unexpected capitalized "O":

  1:1-1:6  warning  `Tinpb` is misspelt; did you mean `TinpOt` ... tinpb  retext-spell
  1:1-1:6  warning  `Tinpc` is misspelt; did you mean `TinpOt` ... tinpc  retext-spell
  1:1-1:6  warning  `Tinpd` is misspelt; did you mean `TinpOt` ... tinpd  retext-spell
  1:1-1:6  warning  `Tinpf` is misspelt; did you mean `TinpOt` ... tinpf  retext-spell
  1:1-1:6  warning  `Tinph` is misspelt; did you mean `TinpOt` ... tinph  retext-spell
  1:1-1:6  warning  `Tinpj` is misspelt; did you mean `TinpOt` ... tinpj  retext-spell
  1:1-1:6  warning  `Tinpl` is misspelt; did you mean `TinpOt` ... tinpl  retext-spell
  1:1-1:6  warning  `Tinpm` is misspelt; did you mean `TinpOt` ... tinpm  retext-spell
  1:1-1:6  warning  `Tinpq` is misspelt; did you mean `TinpOt` ... tinpq  retext-spell
  1:1-1:6  warning  `Tinpv` is misspelt; did you mean `TinpOt` ... tinpv  retext-spell
  1:1-1:6  warning  `Tinpx` is misspelt; did you mean `TinpOt` ... tinpx  retext-spell

On the other hand, these misspellings of Tinpot suggest the correct capitalization:

  1:1-1:6  warning  `Tinpo` is misspelt; did you mean `Tinpot`?  tinpo  retext-spell
  1:1-1:6  warning  `Tinpp` is misspelt; did you mean `Tinpot` ... tinpp  retext-spell
  1:1-1:6  warning  `Tinpu` is misspelt; did you mean `Tinpot` ... tinpu  retext-spell
  1:1-1:6  warning  `Tinpz` is misspelt; did you mean `Tinpot` ... tinpz  retext-spell
tvquizphd commented 3 years ago

It seems this problem is broader than initially recognized. I've discovered 190 additional 5-letter-words suggested by retext-spell that include single unexpected capital letters.

I've included a json file in the gist with one key per result returned by retext-spell with a single unexpected capital letter. Each key lists the misspellings to produce the key. Each misspelling derives from replacing the middle character in a 5-letter dictionary word.

I've counted 39 unexpected uppercase "R"'s, 36 unexpected "C"'s, 34 unexpected "B"'s, 27 unexpected "T"'s, 15 unexpected "E"'s, and ten or fewer unexpected "M"'s, "V"'s, "O"'s, "W"'s, "Y"'s, "N"'s, "P"'s or "U"'s.

tvquizphd commented 3 years ago

TLDR: see nspell issue 37 and nspell issue 41.

wooorm commented 3 years ago

Closing as the nspell PRs are released, and I’m assuming they fixed this!