phorward / unicc

LALR parser generator targetting C, C++, Python, JavaScript, JSON and XML
MIT License
59 stars 9 forks source link

add unicode support to grammar definition #5

Closed wwall closed 3 years ago

wwall commented 6 years ago

When i write in xpl.par

@identifier         'A-Za-zА-Яа-я_' 'A-Za-z0-9_А-Яа-я'* 

i have error on

~/src/c/unicc $ ./unicc -x ./examples/xpl.par 
./unicc: error: ./examples/xpl.par(31):
    Parse error: Invalid input '//'

how i can add Cyrillic letter to identifier definition?

phorward commented 6 years ago

Hello @wwall,

I'm sorry, but unicode input for unicc is currently not possible and requires for huger changes. To add Cyrillic letters to the regular expression, you have to escape them.

@identifier         'A-Za-z\u0410-\u042F\u0430-\u044F_' 'A-Za-z0-9_\u0410-\u042F\u0430-\u044F'*  ;

is parsed corrrectly by unicc and the resulting parser works as expected with UTF-8 input.

A future version 2 of UniCC may contain full unicode support also in grammars. I think this is necessary to compete with current developments, but it is heavy to introduce this feature in UniCC 1.x for now.

wwall commented 6 years ago

thanks for answer

phorward commented 5 years ago

Hello @wwall, recently I've tried to get Unicode support into UniCC 1.x, but it would be a too big change to the entire, existing parser. UniCC v2, which drafted sources you can find in the v2 branch, supports Unicode well, so the definition

@Identifier:=         /[A-Za-zА-Яа-я_][A-Za-z0-9_А-Яа-я]*/
@start$ := Identifier

compiles and runs well with the pparse utility and a input test file, parsing successfully

$ ./pparse test.bnf  rita.txt
start
 Identifier (Яita)

I will re-open this issue now so that the feature will be focused on the next major release.

phorward commented 3 years ago

Will close this now. UniCC will be abandoned.