Closed wwall closed 3 years ago
Hello @wwall,
I'm sorry, but unicode input for unicc is currently not possible and requires for huger changes. To add Cyrillic letters to the regular expression, you have to escape them.
@identifier 'A-Za-z\u0410-\u042F\u0430-\u044F_' 'A-Za-z0-9_\u0410-\u042F\u0430-\u044F'* ;
is parsed corrrectly by unicc and the resulting parser works as expected with UTF-8 input.
A future version 2 of UniCC may contain full unicode support also in grammars. I think this is necessary to compete with current developments, but it is heavy to introduce this feature in UniCC 1.x for now.
thanks for answer
Hello @wwall, recently I've tried to get Unicode support into UniCC 1.x, but it would be a too big change to the entire, existing parser. UniCC v2, which drafted sources you can find in the v2 branch, supports Unicode well, so the definition
@Identifier:= /[A-Za-zА-Яа-я_][A-Za-z0-9_А-Яа-я]*/
@start$ := Identifier
compiles and runs well with the pparse
utility and a input test file, parsing successfully
$ ./pparse test.bnf rita.txt
start
Identifier (Яita)
I will re-open this issue now so that the feature will be focused on the next major release.
Will close this now. UniCC will be abandoned.
When i write in xpl.par
i have error on
how i can add Cyrillic letter to identifier definition?