Open Seelengrab opened 2 years ago
For other examples, Go defines an identifier as "a sequence of one or more letters and digits", where a "letter" is an underscore or a "Unicode letter", and the definition of a "Unicode letter" is "all characters in any of the Letter categories Lu, Ll, Lt, Lm, or Lo".
Meanwhile, Rust defers to the identifier definition from UAX #31, where identifiers must start with a character having the "XID_Start" category, and all the remaining characters must have the "XID_Continue" category.
Of course, it's technically possible to just hard-code the list of characters with a given property into the regex, but that would get bulky and impractical to maintain quite quickly. Looking at the Unicode Character Database, there are 1831 characters in the "Uppercase Letter" category, and even if you merge adjacent codepoints into ranges (like \u000041-\u00005A
for ASCII uppercase) that's still 646 singletons and ranges.
On the other hand, tools generally only support one version of Unicode at a time. If Kakoune was built with Unicode version X, editing scripts for a Julia interpreter that expects Unicode version Y, that conflict could cause problems. It's pretty unlikely given that Unicode is extremely backwards compatible, and all the most useful characters have already been added, but it's possible.
Related #1447.
As long as kakoune doesn't mangle the input provided by the user, a mismatch in unicode version should for syntax highlighting or code searching purposes not matter much imo. It'll get the wrong highlight, but other than that it shouldn't matter. I'm not sure how extensively/deeply the regex part of kakoune is used internally, or for what, so I can't comment on that.
Julia tends to be pretty up to date with the latest version of unicode (currently using version 13, since 14 was only released about a month ago and utf8proc hasn't been updated yet), so would definitely a boon.
Feature
Currently, the
regex
selection for syntax highlighters does not support selectors for unicode categories, such as\p{Lu}
, which matches all uppercase letters. The same is true for its inverse (\P{Lu}
) which matches everything except uppercase letters. This is a request to add support for such selectors, as it would make working with unicode codepoints in regex much more friendly and easier to use.Usecase
The Julia programming language allows (almost) arbitrary unicode as identifiers, with few unicode categories disallowed for various reasons. Making use of unicode categories to exclude instead of include makes the syntax highlighter quite a bit smaller & easier to maintain . Not having that is very detrimental, since it would require having to write out every single unicode character, effectively carrying around a full unicode database on a per-syntax-highlighter basis.