waxeye-org / waxeye

Waxeye is a parser generator based on parsing expression grammars (PEGs). It supports C, Java, JavaScript, Python, Racket, and Ruby.
https://waxeye-org.github.io/waxeye/index.html
Other
235 stars 38 forks source link

Unicode support in C-runtime according to issue #31? #82

Closed hhv closed 6 years ago

hhv commented 6 years ago

Hi,

I tried to make our WaxEye parser work on Docker/Ubuntu and ran into unsupported unicode characters. To solve that I tried (using the latest code from github) the issue #31 syntax ( \x{hhhh} ) but I get an error.

The error is reported on this part: ————————— getal <- +[0-9] ( ?[.,] +[0-9] ) woord <- +[A-Za-z0-9\x{C0}-\x{D6}\x{D9}-\x{FF}] sp <- +[\t\n\r \x{00}-\x{20}\x{A0}] lt <- +([(),.:;"-_+] | apostrophe) apostrophe <- [\x{2018}`'\x{2032}\x{B4}\x{2019}] # \x2032 is “prime" —————————

and the error is: ————————— string-append: contract violation expected: string? given: #<path:/lx/tmp/jetty-0.0.0.0-80-cocoon.war-_-any-1677620253370402457.dir/webapp/linkextractor/links/grammars/document.waxeye> argument position: 2nd other arguments...: "syntax error in grammar " "\n" "33:21 expected: [hex, char] received: x\nwoord <- +[A-Za-z0-9\x{C0}-\x{D6}... context...: /lx/waxeye/src/waxeye/load.rkt:55:0: resolve-modular /usr/share/racket/collects/racket/list.rkt:563:2: append-map /lx/waxeye/src/waxeye/load.rkt:24:0: load-grammar /lx/waxeye/src/waxeye/main.rkt:33:0: main

%mzc:waxeye: [running body]

loop —————————

So I have two questions.

Is the issue #31 syntax implemented?

If so, is there an error in my source document?

Kind regards,

Huib Verweij.

glebm commented 6 years ago

Hi Huib,

Issue #31 was a draft, some ideas got implemented differently. The implemented syntax is \u{1F381}. \t\n\r are not supported.

Waxeye's own grammar is implemented in Waxeye: https://github.com/orlandohill/waxeye/blob/master/grammars/waxeye.waxeye

However, the C-runtime still operates on individual bytes, not Unicode characters, so I'm afraid you'll have to specify UTF-8 bytes instead of Unicode codepoints for the C runtime at the moment.

Only JavaScript, Python, and Ruby, currently support Unicode.

hhv commented 6 years ago

Hi,

thanks for your reply. The implemented syntax \u looks much more to the point indeed. Pity it isn't implemented in C yet.

Using the individual UTF-8 bytes is what caused the errors in the Linux deployment, perhaps I can find a way to work around/fix it until proper Unicode support is there in C.

Thank you for your time.

Huib Verweij.

glebm commented 6 years ago

You could work around it by transforming the grammar file before passing it to waxeye.

By the way, you might find this useful for developing the grammar:

https://glebm.github.io/waxeye/demo.html

getal <- +[0-9] ( ?[.,] +[0-9] )
woord <- +[A-Za-z0-9\u{C0}-\u{D6}\u{D9}-\u{FF}]
sp <- +[\u{09}\u{0A}\u{0D} \u{00}-\u{20}\u{A0}]
lt <- +([(),.:;"-_+] | apostrophe)
apostrophe <- [\u{2018}`'“\u{B4}\u{2019}]