Matching surrogate pair characters

picoe / Eto.Parse

Recursive descent LL(k) parser for .NET with Fluent API, BNF, EBNF and Gold Grammars

MIT License

148 stars 30 forks source link

Matching surrogate pair characters #10

Closed tpluscode closed 10 years ago

tpluscode commented 10 years ago

To fully replicate the property path grammar I partially described in issue #9 I need a parser, which matches high codepoint UTF-8 characters. Originally the rule contained a range

[#x10000-#xEFFFF]

Unfortunately .NET doesn't allow char constants over 65535. Does Eto.Parse support matching such characters?

tpluscode commented 10 years ago

Okay I'm answering my own question here.

Given that 10000 and EFFFF are represented as D800 DC00 and DB7F DFFF respectively I figure it's possible to match the high and low surrogate as separate ranges in sequence.

new CharRangeTerminal('\xD800', '\xDB7F') & new CharRangeTerminal('\xDC00', '\xDFFF')

Any reason why this sould be a bad idea?

cwensley commented 10 years ago

Eto.Parse doesn't directly support this, no. Your approach isn't a bad one, though it would be more efficient (and easier to use) to create a new parser class that knows how to read high unicode characters.

tpluscode commented 10 years ago

I opened a pull request #11. Please follow my progress.