Closed hhv closed 6 years ago
Hi Huib,
Issue #31 was a draft, some ideas got implemented differently.
The implemented syntax is \u{1F381}
. \t\n\r
are not supported.
Waxeye's own grammar is implemented in Waxeye: https://github.com/orlandohill/waxeye/blob/master/grammars/waxeye.waxeye
However, the C-runtime still operates on individual bytes, not Unicode characters, so I'm afraid you'll have to specify UTF-8 bytes instead of Unicode codepoints for the C runtime at the moment.
Only JavaScript, Python, and Ruby, currently support Unicode.
Hi,
thanks for your reply. The implemented syntax \u looks much more to the point indeed. Pity it isn't implemented in C yet.
Using the individual UTF-8 bytes is what caused the errors in the Linux deployment, perhaps I can find a way to work around/fix it until proper Unicode support is there in C.
Thank you for your time.
Huib Verweij.
You could work around it by transforming the grammar file before passing it to waxeye.
By the way, you might find this useful for developing the grammar:
https://glebm.github.io/waxeye/demo.html
getal <- +[0-9] ( ?[.,] +[0-9] )
woord <- +[A-Za-z0-9\u{C0}-\u{D6}\u{D9}-\u{FF}]
sp <- +[\u{09}\u{0A}\u{0D} \u{00}-\u{20}\u{A0}]
lt <- +([(),.:;"-_+] | apostrophe)
apostrophe <- [\u{2018}`'“\u{B4}\u{2019}]
Hi,
I tried to make our WaxEye parser work on Docker/Ubuntu and ran into unsupported unicode characters. To solve that I tried (using the latest code from github) the issue #31 syntax ( \x{hhhh} ) but I get an error.
The error is reported on this part: ————————— getal <- +[0-9] ( ?[.,] +[0-9] ) woord <- +[A-Za-z0-9\x{C0}-\x{D6}\x{D9}-\x{FF}] sp <- +[\t\n\r \x{00}-\x{20}\x{A0}] lt <- +([(),.:;"-_+] | apostrophe) apostrophe <- [\x{2018}`'\x{2032}\x{B4}\x{2019}] # \x2032 is “prime" —————————
and the error is: ————————— string-append: contract violation expected: string? given: #<path:/lx/tmp/jetty-0.0.0.0-80-cocoon.war-_-any-1677620253370402457.dir/webapp/linkextractor/links/grammars/document.waxeye> argument position: 2nd other arguments...: "syntax error in grammar " "\n" "33:21 expected: [hex, char] received: x\nwoord <- +[A-Za-z0-9\x{C0}-\x{D6}... context...: /lx/waxeye/src/waxeye/load.rkt:55:0: resolve-modular /usr/share/racket/collects/racket/list.rkt:563:2: append-map /lx/waxeye/src/waxeye/load.rkt:24:0: load-grammar /lx/waxeye/src/waxeye/main.rkt:33:0: main
%mzc:waxeye: [running body]
loop —————————
So I have two questions.
Is the issue #31 syntax implemented?
If so, is there an error in my source document?
Kind regards,
Huib Verweij.