vim / vim

The official Vim repository
https://www.vim.org
Vim License
35.68k stars 5.35k forks source link

Ship UTF-8 affix files for spell checking #3747

Open tangentsoft opened 5 years ago

tangentsoft commented 5 years ago

For English, at least, Vim currently does not ship spell/en/main.aff, which means that text like the following is flagged with a spelling error:

 I don‘t like Emacs.

Note the curled Unicode quote.

Adding the following as the content of a new file called .../spell/en/main.aff fixes it:

SET utf-8
MIDWORD '-‘

I want this done in the main project because some packaging schemes cause local changes like that to be lost on updates. My immediate use case is MacVim, where /Applications/MacVim.app/ is deleted before the new one is unpacked, but this may happen with other packaging schemes.

I realize this will cause Vim to assume UTF-8, but I think that's been a safe default for years now.

EDIT: I've just discovered that I can solve the packaging problem by saving the file as ~/.vim/spell/en/main.aff, which was not clear to me from my skimming of :help spell. (I say "skim," but I must have spent half an hour on it yesterday. I just mean that I didn't read it word-for-word.) I don't consider this a complete solution to this issue, though, because I still think this should be part of the current distribution of Vim.

brammool commented 5 years ago

Warren Young wrote:

For English, at least, Vim currently does not ship spell/en/main.aff, which means that text like the following is flagged with a spelling error:

 I don‘t like Emacs.

Note the curled Unicode quote.

Adding the following as the content of a new file called .../spell/en/main.aff fixes it:

SET utf-8
MIDWORD '-‘

What is main.aff? I don't see it used anywhere. There are several .aff files, e.g. en_US.aff.

The encoding mentioned here is the encoding of the spell file. It is already utf-8: SET UTF-8

You also add the dash here which I think is incorrect. The dash already is a word character, also when it's at the start or end of a word.

I want this done in the main project because some packaging schemes cause local changes like that to be lost on updates. My immediate use case is MacVim, where /Applications/MacVim.app/ is deleted before the new one is unpacked, but this may happen with other packaging schemes.

I realize this will cause Vim to assume UTF-8, but I think that's been a safe default for years now.

No, it only specifies the encoding of the spell files. So yes, it works fine.

Perhaps you were referring to the Mac version of Vim? I would not know why it has differet spell files.

-- From "know your smileys": |-P Reaction to unusually ugly C code

/// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\ /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\ \\ an exciting new programming language -- http://www.Zimbu.org /// \\ help me help AIDS victims -- http://ICCF-Holland.org ///

tangentsoft commented 5 years ago

The original posting is based on some incorrect thinking.

The primary one is that my chosen example is bad: "don" is an English word, so I was mislead into thinking my proposed fix helps. Let's use a different example text:

I couldn’t do that in Emacs.

That gets flagged as a spelling error because "couldn" isn't an English word.

Now we're left with new problems, the primary one being that my main.aff fix is ineffective. More skimming and searching in :help spell tells me that this is because the affix file is only used by mkspell, and that "only developers need to know about it."

From that I infer that what's needed isn't for Vim to ship these affix files or for it to provide a way for normal end users to supply their own local version, but instead for the ones Vim developers use on their end to be modified to account for Unicode curly quotes in contractions and such.

This isn't about English specifically or even about English contractions. I assume it applies widely, such as to French m’aidez.

Perhaps you were referring to the Mac version of Vim? I would not know why it has differet spell files.

I filed the issue first against MacVim, but they closed it and sent me here. From that I assume they're not doing any MacVim-specific customization to Vim's spell checking mechanism.

brammool commented 5 years ago

Warren Young wrote:

The original posting is based on some incorrect thinking.

The primary one is that my chosen example is bad: "don" is an English word, so I was mislead into thinking my proposed fix helps. Let's use a different example text:

I couldn’t do that in Emacs.

That gets flagged as a spelling error because "couldn" isn't an English word.

Now we're left with new problems, the primary one being that my main.aff fix is ineffective. More skimming and searching in :help spell tells me that this is because the affix file is only used by mkspell, and that "only developers need to know about it."

Correct, the spell code needs to know what characters exactly make up correct word. That is processed into a complicated data structure used to find spelling mistakes (actually finds correct spellings, and what's left arre mistakes).

From that I infer that what's needed isn't for Vim to ship these affix files or for it to provide a way for normal end users to supply their own local version, but instead for the ones Vim developers use on their end to be modified to account for Unicode curly quotes in contractions and such.

This isn't about English specifically or even about English contractions. I assume it applies widely, such as to French m’aidez.

What quotes are valid inside what words is language specific. The normal single quote is used by most languages, this special kind of quote added by Unicode is more specific and is only valid in a number of languages.

Perhaps you were referring to the Mac version of Vim? I would not know why it has differet spell files.

I filed the issue first against MacVim, but they closed it and sent me here. From that I assume they're not doing any MacVim customization to Vim's spell checking mechanism.

OK, I was thinking the MacVim doesn't have Mac specific spell checking.

-- From "know your smileys": (X0||) Double hamburger with lettuce and tomato

/// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\ /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\ \\ an exciting new programming language -- http://www.Zimbu.org /// \\ help me help AIDS victims -- http://ICCF-Holland.org ///

dpelle commented 5 years ago

I think that the *.aff file should contain something like this (among other ICONV rules):

ICONV ’ '

At least recent French Hunspell files have this. But unfortunately, vim import of Hunspell files does not take yet into account ICONV.

brammool commented 5 years ago

Dominique wrote:

I think that the *.aff file should contain something like this (among other ICONV rules):

ICONV ’ '

At least recent French Hunspell files have this. But unfortunately, vim import of Hunspell files does not take yet into account ICONV.

Is there documentation about what ICONV does exactly?

-- Support your right to bare arms! Wear short sleeves!

/// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\ /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\ \\ an exciting new programming language -- http://www.Zimbu.org /// \\ help me help AIDS victims -- http://ICCF-Holland.org ///

dpelle commented 5 years ago

@brammool wrote:

Is there documentation about what ICONV does exactly?

From https://linux.die.net/man/4/hunspell:

ICONV pattern pattern2
    Define input conversion table.

My understanding that that Hunspell transforms input using ICONV rules before probing the dictionary. So that various apostrophes can become the regular ' apostrophe for example.

You can see many ICONV rules in this French dictionary with rules for apostrophe, digraphs and various forms of ways to write diacritics:

https://github.com/titoBouzout/Dictionaries/blob/master/French.aff

ICONV 38
ICONV ’ '
ICONV ffi ffi
ICONV ffl ffl
ICONV ff ff
ICONV fi fi
ICONV fl fl
ICONV à à
ICONV â â
ICONV ä ä
ICONV é é
ICONV è è
ICONV ê ê
etc.
dpelle commented 5 years ago

The documentation of ICONV at https://linux.die.net/man/4/hunspell is short and a bit vague. Let's ask the author of Hunspell @laszlonemeth what ICONV is for exactly, and whether ICONV is the right way to recognize apostrophe ‘ or '.

laszlonemeth commented 5 years ago

Indeed, in Hunspell, you can convert Unicode or typographical apostrophe ’ (U+2019) to the ASCII one (') using ICONV in the case of UTF-8 encoded dictionary stems (there is a “SET UTF-8” in the affix file). But I don't know the ICONV support of mkspell (Vim’s version of MySpell/Hunspell developed by Bram Moolenaar).

There is no ideal solution, because it's still common to use ASCII apostrophes in plain text files, but that has already been a typographical error in document editing.

With an UTF-8 encoded dictionary, you can store the correct typographical apostrophes in the dic file, and optionally, add a

ICONV 1 ICONV ' ’

definition to the dictionary to recognize and accept the words with ASCII apostrophes automatically. Otherwise it’s worth to use MAP or REP to recognize and correct the words with ASCII apostrophes.

Note: The future is to use the typographical one everywhere, but it's not easy in document editors, too (for example, modifying the shortcut Shift-1 to type typographical apostrophe instead of ASCII one in LibreOffice resulted some surprising problems. The last one: https://bugs.documentfoundation.org/show_bug.cgi?id=108423).

shadyalfred commented 2 months ago

Any updates on what is the optimal way to achieve this? What I have done was this:

  1. open vim
  2. :spelldump
  3. save the buffer to ~/.config/vim/en/main.aff file
  4. add SET utf-8 MIDWORD '-’ to the beginning of the file.

It works fine.