Unicode character classes

terpstra commented 5 years ago

Firstly, thanks a lot for this tool. It saved me a lot of time! I am using re2c to create a parser for an as-yet unpublished build tool. The input files are utf-8 encoded. Everything works fine for the ascii character set.

However, I'd like to expand my identifier space to include/allow unicode letters in addition to [a-zA-Z]. Currently the only way to do this that I can see is to write a parser for UnicodeData.txt that grabs all of the letter category code points and dumps them into a giant character class. That's fine, but now I have a generator for a generator for C++. It seems like this sort of Unicode character class functionality would be more naturally supported directly in re2c itself.

I was somewhat surprised this was not already supported, so I went looking for these classes in re2c and could not find them. Apologies if this is already supported and my grep-powers were insufficient.

Thanks!

skvadrik commented 5 years ago

Hi @terpstra , your grep was correct: re2c doesn't support syntactic aliases for Unicode character classes yet. There is no technical reason it can't do that, but you are the first to ask.

As a temporary quick workaround, I can generate and distribute together with re2c source code an "official" file with re2c definitions of Unicode categories: unicode_categories.re.txt. This is to be included verbatim into your .re files; the name L can be used in subsequent re2c blocks to denote Unicode letters. The definitions are generated from the same scripts that generate re2c tests, so the definitions are coherent with what re2c is able to handle at the moment. The generator doesn't use UnicodeData.txt directly (though it should), it uses haskell Data.CharSet library.

terpstra commented 5 years ago

Thanks a lot for this! Does re2c support some form of 'include'? Dumping tables this large into a source file whose main focus is parsing distracts the reader.

Ultimately, I think users will want all the classes and subclasses in Unicode. For example, also the Lu class for upper-case letters / etc. Do you think this is a good candidate for future inclusion?

skvadrik commented 5 years ago

Does re2c support some form of 'include'?

No, but it would be useful. Initial implementation may only allow to include files from current directory (the one re2c is run from), otherwise we'd also need to support include paths.

Ultimately, I think users will want all the classes and subclasses in Unicode.

Agreed.

Do you think this is a good candidate for future inclusion?

Yes. Don't close the issue. :)

terpstra commented 5 years ago

I've noticed that "L \ Lu" in re2c v1.1.1 reports: re2c: error: line 359, column 12: can only difference char sets

It seems that the inclusion of any value above 0x80 in a character class renders it no longer a character class.

skvadrik commented 5 years ago

@terpstra I opened #236: this is a known limitation, but worth a separate issue.

skvadrik commented 5 years ago

@terpstra Meanwhile, re2c learnt to handle include files https://github.com/skvadrik/re2c/commit/b94c5af9a2d150c9421ca3148baa3a625ecce682:

/*!include:re2c "x.re" */ works in the same way as #include "x.re" in C/C++, as if x.re was pasted verbatim in place of the directive.
-I <path> option allows to specify search paths for included files. Default search path is the directory of the source file, e.g. if you run re2c x/y/z.re, then default include path wil be x/y/.

terpstra commented 5 years ago

Nice!

Do you plan to put unicode_categories.re somewhere in the include path? For now I'm just copy-pasting it into my own symbol.re as you suggested.

skvadrik commented 5 years ago

For now I think the best option is to copy unicode_categories.re in your source tree and then put /*!include:re2c "path/unicode_categories.re" */ in your .re file. If unicode_categories.re gets updated, at least you won't have to modify the including .re file and glue it together from pieces.

Perhaps later re2c will install these definition files in some default locations, or at least default relative to re2c root directory, and we'll have a "standard library" of useful regular expressions.

fletcher commented 5 years ago

FYI -- this precompiled set of unicode definitions is fantastic -- I needed to add support for unicode strings to a project I started today, and found this. Made short work of an otherwise complicated problem. Thanks!

(PS-- Thanks for asking about this Brett!)

(PPS -- It goes without saying, but also to second Brett's thanks for re2c. I've been using it for a few years now and am always impressed with how easy it is to use!)

skvadrik commented 5 years ago

Glad to hear that it works for you! I wish the original re2c author and long-time contributors like @dnuffer read the above comment.

terpstra commented 5 years ago

Who is Brett? From the context, it sounds like you meant me.

fletcher commented 5 years ago

@terpstra My apologies, you are right. I saw terpstra and immediately thought of Brett Terpstra since our software projects intersect at times. But that isn't you, so while my comment stands in its intent (I appreciate your asking about this!) it doesn't mean quite as much since we've never met and your name is not Brett.....

Move along.... Nothing to see here.... Just another person making an idiot of themselves on the internet... ;)

mingodad commented 4 years ago

Can someone give an example of character class example to handle unicode ?

Here is what I have now and want to allow IDENTIFIER to contain unicode (UTF-8) characters and also WS to contain unicode white space.

I can see that there is now (1.3) an include "unicode_categories.re" but no example of usage and it's not clear to me how to use it.

/*!re2c
  //re2c:flags:utf-8 = 1;
  re2c:yyfill:enable = 0;

  D        = [0-9] ;
  E        = [Ee] [+-]? D+ ;
  L        = [a-zA-Z_] ;

  INTSUFFIX   = ( "LL" | "ULL" | "ll" | "ull") ;

  INTNUMBER   = ( D+ ) INTSUFFIX? ;
  FLOATNUMBER   = ( D+ | D* "." D+ | D+ "." D* ) E? ;
  CPLXNUMBER   = ( D+ "." D+ ) "i" ;

  HEX_P    = [Pp] [+-]? D+ ;
  HEXNUM = ('0' [xX] [0-9a-fA-F]+) (HEX_P | INTSUFFIX)? ;

  WS       = [ \t\r\v\f] ;
  LF       = [\n] ;
  END      = [\000] ;
  ANY      = [\000-\377] \ END ;

  ESC      = [\\] ;
  SQ       = ['] ;
  DQ       = ["] ;

  STRING1  = SQ ( ANY \ SQ \ ESC | ESC ANY )* SQ ;
  STRING2  = DQ ( ANY \ DQ \ ESC | ESC ANY )* DQ ;

  IDENTIFIER = L ( L | D )* ;

*/

skvadrik commented 4 years ago

You are right, the documentation is lacking. Here is a working example:

#include <assert.h>
#include <stdio.h>

int lex(const char *YYCURSOR)
{
    const char *YYMARKER, *s = YYCURSOR;
    /*!include:re2c "re2c-1.3/include/unicode_categories.re" */
    /*!re2c

    re2c:define:YYCTYPE = 'unsigned char';
    re2c:flags:utf-8 = 1;
    re2c:yyfill:enable = 0;

    D = [0-9] ;
    E = [Ee] [+-]? D+ ;

    INTSUFFIX   = ( "LL" | "ULL" | "ll" | "ull") ;

    INTNUMBER   = ( D+ ) INTSUFFIX? ;
    FLOATNUMBER = ( D+ | D* "." D+ | D+ "." D* ) E? ;
    CPLXNUMBER  = ( D+ "." D+ ) "i" ;

    HEX_P       = [Pp] [+-]? D+ ;
    HEXNUMBER   = ('0' [xX] [0-9a-fA-F]+) (HEX_P | INTSUFFIX)? ;

    WS       = [ \t\r\v\f] ;
    LF       = [\n] ;
    END      = [\000] ;
    ANY      = [^] \ END ;

    ESC      = [\\] ;
    SQ       = ['] ;
    DQ       = ["] ;

    STRING1  = SQ ( ANY \ SQ \ ESC | ESC ANY )* SQ ;
    STRING2  = DQ ( ANY \ DQ \ ESC | ESC ANY )* DQ ;

    IDENTIFIER = L ( L | D )* ;

    "ХЫ!"       { printf("special:    %.*s\n", (int)(YYCURSOR - s), s); return 0; }
    IDENTIFIER  { printf("identifier: %.*s\n", (int)(YYCURSOR - s), s); return 1; }
    STRING1     { printf("string-1:   %.*s\n", (int)(YYCURSOR - s), s); return 2; }
    STRING2     { printf("string-2:   %.*s\n", (int)(YYCURSOR - s), s); return 3; }
    HEXNUMBER   { printf("hex:        %.*s\n", (int)(YYCURSOR - s), s); return 4; }
    INTNUMBER   { printf("integer:    %.*s\n", (int)(YYCURSOR - s), s); return 5; }
    FLOATNUMBER { printf("floating:   %.*s\n", (int)(YYCURSOR - s), s); return 6; }
    CPLXNUMBER  { printf("complex:    %.*s\n", (int)(YYCURSOR - s), s); return 7; }
    *           { printf("error\n"); return -1; }

    */
}

int main()
{
    assert(lex("ХЫ!") == 0);
    assert(lex("хыхы") == 1);
    assert(lex("'хыхы'") == 2);
    assert(lex("\"хыхы\"") == 3);
    assert(lex("0x3ff") == 4);
    assert(lex("123") == 5);
    assert(lex("123.45e-6") == 6);
    assert(lex("123.45i") == 7);
    return 0;
}

I assumed that unicode_categories.re are in a subdirectory re2c-1.3/include, but it may be a different place depending on your system and re2c installation (you can always use -I). Build:

$ re2c unicode_example.re -W \
     --input-encoding utf8 \
     -ounicode_example.c \
   && cc unicode_example.c -ounicode_example

Here --input-encoding utf8 is only needed if you plan to use Unicode literals like ХЫ! in this example (it-s an orthogonal feature to unicode_categories.re). Outptut:

$ ./unicode_example
special:    ХЫ!
identifier: хыхы
string-1:   'хыхы'
string-2:   "хыхы"
hex:        0x3ff
integer:    123
floating:   123.45e-6
complex:    123.45i

mingodad commented 4 years ago

Thank you for the example ! Here is the same a bit modified to manage unicode white space and also underscores in identifiers:

#include <assert.h>
#include <stdio.h>

enum {
    TK_LEXERROR=-1,
    TK_SPECIAL,
    TK_WS,
    TK_IDENT,
    TK_STR_SQ,
    TK_STR_DQ,
    TK_HEXNUM,
    TK_INTNUM,
    TK_FLOATNUM,
    TK_COMPLEXNUM,
};

int lex(const char *YYCURSOR)
{
    const char *YYMARKER, *s = YYCURSOR;
    /*!include:re2c "re2c-1.3/include/unicode_categories.re" */
    /*!re2c

    re2c:define:YYCTYPE = 'unsigned char';
    re2c:flags:utf-8 = 1;
    re2c:yyfill:enable = 0;

    D = [0-9] ;
    E = [Ee] [+-]? D+ ;

    INTSUFFIX   = ( "LL" | "ULL" | "ll" | "ull") ;

    INTNUMBER   = ( D+ ) INTSUFFIX? ;
    FLOATNUMBER = ( D+ | D* "." D+ | D+ "." D* ) E? ;
    CPLXNUMBER  = ( D+ "." D+ ) "i" ;

    HEX_P       = [Pp] [+-]? D+ ;
    HEXNUMBER   = ('0' [xX] [0-9a-fA-F]+) (HEX_P | INTSUFFIX)? ;

    WS       = ([ \t\r\v\f] | Zs | Zp);
    LF       = [\n] ;
    END      = [\000] ;
    ANY      = [^] \ END ;

    ESC      = [\\] ;
    SQ       = ['] ;
    DQ       = ["] ;

    STRING1  = SQ ( ANY \ SQ \ ESC | ESC ANY )* SQ ;
    STRING2  = DQ ( ANY \ DQ \ ESC | ESC ANY )* DQ ;

    IDENTIFIER = ('_' | L) ( '_' | L | D )* ;

    "ХЫ!"       { printf("special:    %.*s\n", (int)(YYCURSOR - s), s); return TK_SPECIAL; }
    WS  { printf("white space: >%.*s<\n", (int)(YYCURSOR - s), s); return TK_WS; }
    IDENTIFIER  { printf("identifier: %.*s\n", (int)(YYCURSOR - s), s); return TK_IDENT; }
    STRING1     { printf("string-1:   %.*s\n", (int)(YYCURSOR - s), s); return TK_STR_SQ; }
    STRING2     { printf("string-2:   %.*s\n", (int)(YYCURSOR - s), s); return TK_STR_DQ; }
    HEXNUMBER   { printf("hex:        %.*s\n", (int)(YYCURSOR - s), s); return TK_HEXNUM; }
    INTNUMBER   { printf("integer:    %.*s\n", (int)(YYCURSOR - s), s); return TK_INTNUM; }
    FLOATNUMBER { printf("floating:   %.*s\n", (int)(YYCURSOR - s), s); return TK_FLOATNUM; }
    CPLXNUMBER  { printf("complex:    %.*s\n", (int)(YYCURSOR - s), s); return TK_COMPLEXNUM; }
    *           { printf("error\n"); return TK_LEXERROR; }

    */
}

int main()
{
    assert(lex("ХЫ!") == TK_SPECIAL);
    assert(lex("хыхы") == TK_IDENT);
    assert(lex("見る") == TK_IDENT);
    assert(lex("_見_る") == TK_IDENT);
    assert(lex("見_る") == TK_IDENT);
    assert(lex("_見_る_") == TK_IDENT);
    assert(lex(" ") == TK_WS);
    assert(lex("    ") == TK_WS);
    assert(lex("\r") == TK_WS);
    assert(lex("\v") == TK_WS);
    assert(lex("\f") == TK_WS);
    assert(lex("　") == TK_WS);
    assert(lex("'хыхы'") == TK_STR_SQ);
    assert(lex("'見る'") == TK_STR_SQ);
    assert(lex("\"хыхы\"") == TK_STR_DQ);
    assert(lex("\"見る\"") == TK_STR_DQ);
    assert(lex("0x3ff") == TK_HEXNUM);
    assert(lex("123") == TK_INTNUM);
    assert(lex("123.45e-6") == TK_FLOATNUM);
    assert(lex("123.45i") == TK_COMPLEXNUM);
    return 0;
}

Also looking at https://www.fileformat.info/info/unicode/category/index.htm I could see the description for the character classes and looking at unicode_categories.re I could see that there is literal repetitions of several characters like:

Z = [\x20-\x20\xa0-\xa0\u1680-\u1680\u2000-\u200a\u2028-\u2029\u202f-\u202f\u205f-\u205f\u3000-\u3000];
Zs = [\x20-\x20\xa0-\xa0\u1680-\u1680\u2000-\u200a\u202f-\u202f\u205f-\u205f\u3000-\u3000];
Zl = [\u2028-\u2028];
Zp = [\u2029-\u2029];

There is any disadvantage in using something like the rewrite bellow ?

/*Separator, Space*/
Zs = [\x20-\x20\xa0-\xa0\u1680-\u1680\u2000-\u200a\u202f-\u202f\u205f-\u205f\u3000-\u3000];
/*Separator, Line*/
Zl = [\u2028-\u2028];
/*Separator, Paragraph*/
Zp = [\u2029-\u2029];
/*Separators*/
Z = (Zs | Zl | Zp) ;

Cheers !

skvadrik commented 4 years ago

@mingodad Thanks for the extended program!

There is any disadvantage in using something like the rewrite bellow ?

No, absolutely not, and I would write it that way if I wrote it by hand. As it happens though, the file is autogenerated by a script https://github.com/skvadrik/re2c/blob/master/test/encodings/unicode_groups.hs#L149.

The script can be fixed to generate shorter output. That would probably not affect the time spent by re2c on compilation by much (the bottleneck is usually large size of the DFA caused by the complexity of Unicode character classes). It certainly shouldn't affect the generated DFA.

NickStrupat commented 4 years ago

Hi folks,

Just an FYI, I wrote a small C++ program to generate the Unicode 13.0 category definitions for re2c.

https://github.com/NickStrupat/re2c-unicode-categories

Thank you for your hard work building and maintaining re2c!

skvadrik commented 4 years ago

@NickStrupat Awesome, thank you! Do you mind if I add your repo as a submodule and update include/unicode_categories.txt with the output of your program?

NickStrupat commented 4 years ago

Don't mind at all :)

I'm just giving it a test now, so maybe hold off until I make sure it's all working.

skvadrik commented 4 years ago

Sure, just give me a shout when you are done.

skvadrik commented 4 years ago

Hi @NickStrupat, any update on this? Did you have the time to test your program?

NickStrupat commented 4 years ago

Not definitively. I think it works, but I'm not sure how to test it well, given my current time allowance.

skvadrik commented 4 years ago

Ok, thanks.

skvadrik commented 5 days ago

We should probably parse https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt directly rather than rely on other language libraries.

skvadrik / re2c

Unicode character classes #235