vim / vim

The official Vim repository
https://www.vim.org
Vim License
35.55k stars 5.33k forks source link

wrap problem about combining English words and Chinese characters #2579

Open CoinCheung opened 6 years ago

CoinCheung commented 6 years ago

Hello, I believe this I met is a real issue about wrap: When I write a combination of English and Chinese, I cannot guarantee that the line wraps at the correct position. Since Chinese characters do not separate with spaces, the Chinese lines will be regard as a single English word, and the line will wrap between English words instead of Chinese characters, like this:

line 1:  These are English words, followed by a many           
            Chinese这些是中文这些是中文这些是中文这些是中文

The correct break point should be among the Chinese characters but here it is before the English word "Chinese", since there is a space before it and the following Chinese characters are regarded as some letters appended to the word "Chinese". Even though I add a space after the English words, the line still suffers from getting wrapped too early when the English sentence is too short to meet the end of this line.

Hope I made myself clear. Do you have features to fix this or is this some feature that should be added?

vim-ml commented 6 years ago

On Thu, Jan 25, 2018 at 1:34 AM, CoinCheung vim-dev-github@256bit.org wrote:

Hello, I believe this I met is a real issue about wrap: When I write a combination of English and Chinese, I cannot guarantee that the line wraps at the correct position. Since Chinese characters do not separate with spaces, the Chinese lines will be regard as a single English word, and the line will wrap between English words instead of Chinese characters, like this:

line 1: These are English words, followed by a many Chinese这些是中文这些是中文这些是中文这些是中文

The correct break point should be among the Chinese characters but here it is before the English word "Chinese", since there is a space before it and the following Chinese characters are regarded as some letters appended to the word "Chinese". Even though I add a space after the English words, the line still suffers from getting wrapped too early when the English sentence is too short to meet the end of this line.

Hope I made myself clear. Do you have features to fix this or is this some feature that should be added?

I suppose there ought to be a space between the last Latin-script word (here, "Chinese") and the first hanzi (这); but of course this wouldn't take care of the fundamental problem, which is that wrapping rules are not the same for Latin and hanzi scripts.

I think that this falls under the heading "Vim is a plain text editor, not a WYSIWYG text processor"; in particular, AFAIK, there is no provision in Vim for breaking lines differently between Latin letters and hanzi, any more than there is for displaying Latin text LTR and Hebrew or Arabic text RTL in the same window (unless, in the latter case, Vim runs in an "intelligent terminal" such as mlterm, where the bidirectionality is handled by the terminal and not by Vim). If we knew about a terminal which could break lines "at whitespace characters within Latin text and anywhere between hanzi", a new entry might be made for this problem in the todo.txt helpfile; but until or unless such a terminal exists, I doubt that there is a solution, other than adding ZWNJ (zero-width non-joiner) characters at "likely" places between the hanzi.

Best regards, Tony.

jamessan commented 6 years ago

The todo list already has a few potentially related items in it.

7 Add plugins for formatting? Should be able to make a choice depending on the language of a file (English/Korean/Japanese/etc.). Setting the 'langformat' option to "chinese" would load the "format/chinese.vim" plugin. The plugin would set 'formatexpr' and define the function being called. Edward L. Fox explains how it should be done for most Asian languages. (2005 Nov 24) Alternative: patch for utf-8 line breaking. (Yongwei Wu, 2008 Feb 23)

Have a look at patch for utf-8 line breaking. (Yongwei Wu, 2008 Mar 1, Mar 23) Now at: http://vimgadgets.sourceforge.net/liblinebreak/

  • Support breakpoint character ? 0xb7 and ignore it? Makes it possible to use same wordlist for hyphenation.
vim-ml commented 6 years ago

On Thu, Jan 25, 2018 at 2:25 AM, James McCoy vim-dev-github@256bit.org wrote:

The todo list already has a few potentially related items in it.

7 Add plugins for formatting? Should be able to make a choice depending on the language of a file (English/Korean/Japanese/etc.). Setting the 'langformat' option to "chinese" would load the "format/chinese.vim" plugin. The plugin would set 'formatexpr' and define the function being called. Edward L. Fox explains how it should be done for most Asian languages. (2005 Nov 24) Alternative: patch for utf-8 line breaking. (Yongwei Wu, 2008 Feb 23)

Have a look at patch for utf-8 line breaking. (Yongwei Wu, 2008 Mar 1, Mar 23) Now at: http://vimgadgets.sourceforge.net/liblinebreak/

Support breakpoint character ? 0xb7 and ignore it? Makes it possible to use same wordlist for hyphenation.

U+00B7 MIDDLE DOT is a printable character, it MUST NOT be ignored. Not only is it one of the characters which can be used as a bullet in unnumbered lists, it is also used in Catalan to differentiate the "geminated L", l·l as in col·lega "colleague" from the "palatalized L", ll as in lleidatà (relating to, or living in, the town Lleida, whose Castilian name used to be Lérida).

If we must use a zero-width word separator, let's use a character made for that purpose, for instance U+200B ZERO WIDTH SPACE. That character exists in UTF-8 of course, but also in GB18030, which is IIUC a "China-centered" encoding which can encode any Unicode codepoint and is (again, IIUC) required to be supported on all new software sold in mainland China.

Best regards, Tony.

mattn commented 6 years ago

As far as I can see, this is an issue that "We can't linebreak at the point english and non-english. Not related on Unicode classes. So it may be possible to add option to break the line. But I'm thinking we hope to break the line by unicode classes. (Sorry, OT) Japanese has some character classes like Hiragana, Katakana, Kanji. So we can stop to move cursor forward. And it will be possible to break lines by separating by the classes.

terminal3

yifeikong commented 3 years ago

One not-so-perfect solution is to set breakat=(empty), vim will break at the given length. However, English words are also broken into two lines, which is also annoying.

Really hope this could be fixed.

yifeikong commented 3 years ago

When I write documentations/posts/articles in vim, almost all my lines contains both English words and Chinese characters. I think the solution is not to add a new line break option for Chinese or even CJK languages, but to add an option that lets me treat each Chinese character as a word, which also reflects how we use the language.

The bug here is that vim treats Chinese sentences as words, not about line-breaking.

Also, this certainly does not break the "Vim is a plain text editor, not a WYSIWYG text processor" rule.

k-takata commented 3 years ago

The similar issue in text formatting has been fixed in 8.2.0901. So, similar implementation for wrapping might be needed.