mawww / kakoune

mawww's experiment for a better code editor
http://kakoune.org
The Unlicense
9.94k stars 716 forks source link

[REQUEST] Support unicode category selectors in regex for syntax highlighting #4411

Open Seelengrab opened 2 years ago

Seelengrab commented 2 years ago

Feature

Currently, the regex selection for syntax highlighters does not support selectors for unicode categories, such as \p{Lu}, which matches all uppercase letters. The same is true for its inverse (\P{Lu}) which matches everything except uppercase letters. This is a request to add support for such selectors, as it would make working with unicode codepoints in regex much more friendly and easier to use.

Usecase

The Julia programming language allows (almost) arbitrary unicode as identifiers, with few unicode categories disallowed for various reasons. Making use of unicode categories to exclude instead of include makes the syntax highlighter quite a bit smaller & easier to maintain . Not having that is very detrimental, since it would require having to write out every single unicode character, effectively carrying around a full unicode database on a per-syntax-highlighter basis.

Screwtapello commented 2 years ago

For other examples, Go defines an identifier as "a sequence of one or more letters and digits", where a "letter" is an underscore or a "Unicode letter", and the definition of a "Unicode letter" is "all characters in any of the Letter categories Lu, Ll, Lt, Lm, or Lo".

Meanwhile, Rust defers to the identifier definition from UAX #31, where identifiers must start with a character having the "XID_Start" category, and all the remaining characters must have the "XID_Continue" category.

Of course, it's technically possible to just hard-code the list of characters with a given property into the regex, but that would get bulky and impractical to maintain quite quickly. Looking at the Unicode Character Database, there are 1831 characters in the "Uppercase Letter" category, and even if you merge adjacent codepoints into ranges (like \u000041-\u00005A for ASCII uppercase) that's still 646 singletons and ranges.

On the other hand, tools generally only support one version of Unicode at a time. If Kakoune was built with Unicode version X, editing scripts for a Julia interpreter that expects Unicode version Y, that conflict could cause problems. It's pretty unlikely given that Unicode is extremely backwards compatible, and all the most useful characters have already been added, but it's possible.

3059 is another issue that would require Kakoune to bundle a Unicode character database.

lenormf commented 2 years ago

Related #1447.

Seelengrab commented 2 years ago

As long as kakoune doesn't mangle the input provided by the user, a mismatch in unicode version should for syntax highlighting or code searching purposes not matter much imo. It'll get the wrong highlight, but other than that it shouldn't matter. I'm not sure how extensively/deeply the regex part of kakoune is used internally, or for what, so I can't comment on that.

Julia tends to be pretty up to date with the latest version of unicode (currently using version 13, since 14 was only released about a month ago and utf8proc hasn't been updated yet), so would definitely a boon.