Question about unicode categories

nalply commented 10 months ago

I need to match a token which contains unicode scalar values in the categories L, M, N, P, S and Cf.

I see three different ways to solve this:

I wrote a small tool to generate the character set (about 15 kB of text) and include it
I use a simpler character set and verify with a regex in the semantic action
I hack lexgen

@osa1, what do you think?

osa1 commented 10 months ago

I wrote a small tool to generate the character set (about 15 kB of text) and include it

Is this text full of unique characters, or do you extract the characters from the text?

15 KB of characters is just huge, compile times will be terrible.

How many characters are there in each of these categories?

I use a simpler character set and verify with a regex in the semantic action

This may be a bit error prone, because you can't jump to the next semantic action from one semantic action, once you're in a semantic action you have to go back to the beginning of a state. Example:

// A two character sequence.
_ _ => |lexer| {
    // If you don't like the matched characters here there's no way to run the next
    // semantic action below.
    ...
},

// There's no way to run this semantic action because of the rule above.
"ab" => |lexer| {
    ...
},

If you're OK with this limitation that I think this should work.

I hack lexgen

This works too, and we may even consider including these unicode categories in as a built-in, similar to $$XID_Start and $$XID_Continue built-in character sets. See here for the implementation of built-in character sets. The character ranges are generated by this program.

I've never heard of these categories before and I don't know how useful they are to an average user though. I don't know if it's worth including them in the library.

nalply commented 10 months ago

Is this text full of unique characters, or do you extract the characters from the text?

15 kB is just the source code of the character set.

I've never heard of these categories before and I don't know how useful they are to an average user though. I don't know if it's worth including them in the library.

I understand.

Perhaps you find following explanation interesting.

L is the category of all letters and letter-likes scalars; M is the category of all marks, for example combining accents, some languages even have surrounding marks; N is the category of all digits, there are the arabic digits, but some languages in India use different digits, and there are super- and subscript digits; P is the category of all punctuation and S is the category of all symbols (like the dollar symbol). Cf is a subcategory of control codes, these are formatting codes, I include them because some combining emojis have the zero-width joinder, wich is in category Cf. In other words, this is a token which can contain emoji, even complicated ones! Soft hyphen is also in Cf.

Another way to look at this: any unicode scalar is allowed except separators (like space, carriage return, line feed, form feed) and control characters (like NUL, Ctrl-G aka the bell, CSI (0x9b) wich is used for ANSI colors as an equivalent of ESC [, and others. Formatting control characters are allowed because, see above.

This said, I am very grateful for your lexer, and I don't expect you to do anything.

nalply commented 10 months ago

It looks there's a fourth possibility: just use the built-ins, because some built-ins are same as the mentioned unicode categories! For example $$alphabetic might correspond to the category L.

Yay!!

osa1 / lexgen

Question about unicode categories #67