skvadrik / re2c

Lexer generator for C, C++, Go and Rust.
https://re2c.org
Other
1.08k stars 169 forks source link

case sensitivity in unicode #118

Open skvadrik opened 9 years ago

skvadrik commented 9 years ago

For non-ASCII character sets (encodings UTF-8, UTF-16, UTF-32, UCS-2 and EBCDIC) re2c should treat uppercase/lowercase letters differently.

The following example should match uppercase and lowercase letter 'ы':

/*!re2c
    '\u044b' {}
*/

re2c generates code that matches only lowercase letter:

 $ re2c -ix 1.re 
/* Generated by re2c 0.14.3 on Mon Aug 10 22:41:51 2015 */

{
        YYCTYPE yych;

        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        if (yych == 0x044B) goto yy3;
yy3:
        ++YYCURSOR;
        {}
}
skvadrik commented 9 years ago

Note: cannot use 'towlower/towupper' standard functions as they depend on locale being in use. The only standard locales are "C" and "" (default): users may not have the appropriate locale. We need to do this manually.

sirzooro commented 8 years ago

You could implement workaround - allow to specify locale via new command line param, and generate code for this locale only, with all char codes hardcoded.

skvadrik commented 8 years ago

A good workaround in case people start complaining about the issue.

Or should we wait some two decades until everyone has Unicode locale. :D