microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
162.22k stars 28.55k forks source link

Feature request: Treat the Chinese text as a Chinese sequence when using`Ctrl+Left/Right` #50045

Closed imhuay closed 2 months ago

imhuay commented 6 years ago

Now the VSCode treats a long Chinese text as one “word”. Each time use Ctrl+Left/Right, it will move the cursor to the begin or end.

The feature request is that treat the Chinese text as a Chinese sequence, then each Ctrl+Left/Right, it just move one step. This act is the system text program default.

Example: (use | as the cursor )

|本文的学习公式
// Ctrl+Right
本文的学习公式|

Expected:

|本文的学习公式
// Ctrl+Right
本|文的学习公式
// Ctrl+Right
本文|的学习公式
// Ctrl+Right
本文的|学习公式

(Of course, It would be better if it can support Word Segmentation.)

gdh1995 commented 6 years ago

It's better if VS Code can support Word Segmentation just like what Chrome does, although I know that this requires a big data dict and increases the program package size a lot. But, if it doesn't like to segment words, then I suggest that it keeps moving cursor once a sentence, instead of a char - personally, I think it is too slow to jump a Chinese char on <Ctrl+Right>.

imhuay commented 6 years ago

However, it doesn't move once a sentence. Actually, it also can't recognize the Chinese punctuations.

Examples:

|output gate 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output| gate 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output gate| 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output gate 会影响结果,因此该模型有两个版本,分别为是否使用|
smikitky commented 5 years ago

This is a longstanding problem which virtually all East-Asian developers will notice once they start editing natural sentences (say, in Markdown) on vscode. I think this is fundamentally a problem of wrong word-splitting for CJK languages (and perhaps Thai, too), which use no spaces to delimit words. A similar problem happens when you double-click a word in a line (the whole line will be selected instead of the target word) and when you trigger an autocompletion using Ctrl+Space (a whole line will be shown as a candidate).

Ideally, dictionary-based word segmentation is desirable (this is available on MS Word, Google Chrome browser, etc), but it's not 100% correct, and I'm not sure if it is really necessary for a code editor. Another practical approach that works at least in Japanese is to split words based on character types, because a typical Japanese text is a mixture of kanji, hiragana and katakana (This algorithm is implemented on most domestic text editors and even MS Notepad.exe). Character types can be easily determined via Unicode code points.

Example:

(1) 吾輩は猫である。名前はまだない。
(2) 吾輩|は|猫|で|ある|。|名前|は|まだ|ない|。
(3) 吾輩|は|猫|である|。|名前|はまだない|。

(1): Natural Japanese text with two sentences. is a Japanese period.; (2): Dictionary-based word boundaries (|), available on MS Word, Chrome, etc.; (3): Codepoint-based kana-kanji boundaries, available on Firefox, Notepad.exe, etc.

There is already a popular extension that does (3) above for Japanese text. Unfortunately, it works on Ctrl+ / but nowhere else. It does not work on double-clicks, Ctrl+D, autocompletion, text search, and so on.

Personally, I think (3) should be implemented as part of the basic functionality of VSCode, considering the fact that it's available on any other decent text editors. Dictionary-based solution (2) may be too costly within the main vscode repository, but I hope there is a way to allow extension developers to override word-boundary detection algorithm or the double-click behavior.


By the way, for the meantime, you can alleviate this problem by tweaking "editor.wordSeparators" settings and adding multibyte punctuation marks such as . With this, you can stop the cursor at least at (double-byte) periods and commas using Ctrl + /

smikitky commented 5 years ago

So I searched related issues regarding CJK text navigations. I learned that "selection/navigation via double-click/keyboards" and "extracting words for autocompletion" are technically two different fields, but they are conceptually related anyway.

Keyboard navigation & Double click:

Word extraction for autocompletion:

So in conclusion, IMHO vscode should (by default, regardless of the language) assume there is a word boundary when a character type changes between "Latin alphabet/number", "CJK unified ideograph (hanja/kanji)", "Punctuations Marks (incl. multibyte ones)", "Japanese hiragana" and "Japanese katakana" even if there is no space. In addition, when Ctrl+Right is input inside a sequence of multiple "CJK unified ideographs", Chinese users (seem to) want the cursor to move by one character, whereas Japanese users usually want the cursor to move to the end of the sequence, as described by (3) above. This may have to be configurable with locale-based default values.

// Japanese これは日|本語の文章 // ctrl + right これは日本語|の文章

// Chinese 本文|的学习公式 // ctrl + right 本文的|学习公式

rebornix commented 5 years ago

@smikitky thanks for your detailed investigation ;) IMO word navigation should work seamlessly with CJKV, as ASCII word separators can't handle CJKV words. I do have a prototype of delegating the word segmentation to the browser instead of dealing that ourselves and will work on that in the near team, stay tuned.

WangLeto commented 5 years ago

I'd like to remind you of Ctrl + Delete, which I think may share the same logic as Ctrl + Arrow, and performs even more upset because you may easily delete too many characters by accident.

smikitky commented 4 years ago

@rebornix This issue was once included in iteration plans, but I'm seeing no recent activity. Since we're nearing the end of the housekeeping iteration, can I ask if you have any update on this?

rebornix commented 4 years ago

Let's see if we can have time for it during holiday time.

yuboona commented 3 years ago

Is there any progress now, guys?

yuboona commented 3 years ago

I found a vscode extension CJK word handler, will it be offically adopt? @kieferrm @rebornix

simonmysun commented 8 months ago

Any updates?

Actually in any chromium based applications, a proper word segmentation library is bundled. You can try it in i.e. file renaming input box with common key bindings. This is however seems not available to the JavaScript interface: https://bugs.chromium.org/p/chromium/issues/detail?id=129706 .

While I am appreciative of japanese-word-handler by @sgryjp and CJK word handler by @SharzyL, it seems a bit redundant to include another segmentation JavaScript library, especially when we already have a much faster one in C++. I wonder if an alternative workaround would work: copy the current line to a hidden input box and synchronize the cursor movement.

yume-chan commented 8 months ago

The segmenter is available in JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

For example:

console.log(
  JSON.stringify(
    Array.from(
      new Intl.Segmenter("en", { granularity: "word" }).segment(
        "本文的学习公式",
      ),
    ),
    undefined,
    4,
  ),
);

(In my test, it can segment any CJK language no matter which locale is specified in the constructor)

outputs

[
    {
        "segment": "本文",
        "index": 0,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "的",
        "index": 2,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "学习",
        "index": 3,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "公式",
        "index": 5,
        "input": "本文的学习公式",
        "isWordLike": true
    }
]
yutotnh commented 7 months ago

The segmenter is available in JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

Just recently I developed an extension that takes advantage of it.

I really wanted to make a pull request to the main body of VS Code, but my technical capabilities were limited to releasing it as an extension.

Image in operation examples

yutotnh commented 7 months ago

I was able to add functionality to VS Code itself, so I created a pull request. The pull request created is #203605.

Now that it can be integrated within a process, it can do more than just be an extension, such as being able to select words with a double-click.

rinzwind5 commented 6 months ago

I was surprised, since VS Code is based on browser technology and browsers handle this stuff great. Also issue with Thai: คนไทยที่นับถือศาสนาพุทธเกินห้าสิบเปอร์เซ็นต์ Should be broken down as คน ไทย ที่ นับถือ ศาสนา พุทธ เกิน ห้า สิบ เปอร์เซ็นต์

https://fuqua.io/thai-word-split/browser/

Makes VS Code at the very least unusable as generic text editor for these languages. Notepad does work fine btw (so Windows has native support as expected).

alexdima commented 2 months ago

Thanks to https://github.com/microsoft/vscode/pull/203605 It is now possible to configure editor.wordSegmenterLocales to define the locales to be used for word segmenting