Unicode character range

scymtym / esrap

Common Lisp packrat parser

https://scymtym.github.io/esrap/

78 stars 12 forks source link

Unicode character range #13

Closed cage2 closed 3 years ago

cage2 commented 3 years ago

Hi!

First i want to say that this library is wonderful, i do not want to use any other parser generator now. ;-)

I am trying to parse some sequence that can contains unicode character (for an IRI parser) and i wrote a rule like that:

(defrule ucschar (or (character-ranges #\UA0  #\UD7FF))
  (:text t))

but:

(esrap:parse 'ucschar "ì")

fails with:

At

  ì
  ^ (Line 1, Column 0, Position 0)

In context UCSCHAR:

While parsing UCSCHAR. Expected:

     a character in [ ] or [퟿]

i even tried a rule like:

(defrule ucschar (character-ranges #\à #\ò))

but this fails too with trying to parse "ì".

Maybe i am using the library in the wrong way?

Can you, please, help me?

Thank you very much. C.

scymtym commented 3 years ago

character-ranges has a slightly subtle syntax to allow ranges as well as individual characters to be specified:

(esrap:character-ranges #\UA0 #\UD7FF) denotes the set of characters consisting of exactly #\UA0 and #\UD7FF.
(esrap:character-ranges (#\UA0 #\UD7FF)) denotes the set of characters consisting of the range starting at #\UA0 and ending at #\UD7FF.

This allows specifying multiple ranges as well as individual characters not contained in any range at the same time: (esrap:character-ranges (START₁ END₁) (START₂ END₂) … INDIVIDUAL-CHARACTER₁ INDIVIDUAL-CHARACTER₂ …).

I hope this solves your concrete problem and also explains the rationale behind the syntax.

cage2 commented 3 years ago

character-ranges has a slightly subtle syntax to allow ranges as well as individual characters to be specified:

[...]

I hope this solves your concrete problem and also explains the rationale behind the syntax.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/scymtym/esrap/issues/13#issuecomment-744005132

This totally makes sense to me, also now the modified parser (according to your suggestions) works like a charm!

Moreover i was actually using the second syntax in other parts of my code, so this issue was (as i suspected) totally a mistake from my part.

Sorry to if i wasted your time with this trivial mistake, sometimes i can not find errors without talking about the issue with other people.

Thank you for kind reply! Bye! C.