picoe / Eto.Parse

Recursive descent LL(k) parser for .NET with Fluent API, BNF, EBNF and Gold Grammars
MIT License
148 stars 30 forks source link

Parsers for surrogate pair UTF characters #11

Closed tpluscode closed 10 years ago

tpluscode commented 10 years ago

This is a work in progress. When done it will implement matching UTF-16 characters, which currently don't have a specific parser.

To-Do:

cwensley commented 10 years ago

Excellent start! Should the GetValue implementation return an int or a string? Or could that be configurable?

tpluscode commented 10 years ago

I think it should return an int. A pair of chars would be inconvenient (Tuple<,> ?) and most string representations aren't printable anyway I guess.

tpluscode commented 10 years ago

Hm, I noticed that CharTerminal has its dedicated RepeatCharTerminal. I'm obviously closely imitating the CharTerminal classes and thus I'm wondering whether I should also add a specialized repeat parser...

cwensley commented 10 years ago

The RepeatCharTerminal is used for more performance vs. using many alternate CharTerminals inside a RepeatParser.. I don't think that it will be necessary for the surrogate pair characters, as it isn't as if an entire document will be consisted of such characters.

Ideally, one would be able to test for surrogate characters using the existing RepeatCharParser instead.. though I'm not sure of the performance impact there..

tpluscode commented 10 years ago

Using existing char parsers for surrogate pairs may not be a good idea, because they are quite different. Each one is actually two chars, each from a specific range and must come in correct order.

Or maybe did I misunderstand you?

cwensley commented 10 years ago

I agree that we shouldn't change the existing char parsers. The RepeatCharTerminal uses RepeatCharItem with a test function and doesn't use CharTerminal at all. I was suggesting that only the RepeatCharTerminal could be enhanced to deal with surrogate pairs. I don't know at this point whether that would be a good idea or not, so if you think it's best to leave it or create a RepeatSurrogateCharTerminal I'm good with either way.

tpluscode commented 10 years ago

I see. It would be possible. However this probably is esoteric enough that no one will ever notice that the generic repeat parser isn't as performant.

cwensley commented 10 years ago

Yeah, I believe you are right. (;

tpluscode commented 10 years ago

I guess I'm pretty much done. Please review the code

tpluscode commented 10 years ago

Ah yes, I know. I wanted to overload that GetValue methods and Description property

cwensley commented 10 years ago

Wow! looks awesome! I'm still in the process of setting up the CI builds, but it seems that two of the new tests aren't passing? Getting "Unable to cast object of type 'System.Int32' to type 'System.String'." This might be the CI being weird though..

tpluscode commented 10 years ago

Oops, probably not. I hastily overrided the GetValue method probably broke the casting in tests.

cwensley commented 10 years ago

Awesome, thanks for the contribution! Are you okay with contributing under MIT, and assigning copyright?

tpluscode commented 10 years ago

Sure, no problem there

tpluscode commented 9 years ago

Hi. I see that this pull reuqest hasn't been released. Any chance for a NuGet update?