mazira / rtf-stream-parser

Contains native Node classes for transforming an RTF byte stream into tokens, and de-encapsulating HTML
MIT License
23 stars 4 forks source link

currently ignores escaped characters. #3

Closed JaredCE closed 5 years ago

JaredCE commented 6 years ago

further playing around with this library, it seems that you're not correctly treating escaped backslashes (\) and curly braces ({ and }) correctly, they're not coming up as text, but either control or group types.

TheElementalOfDestruction commented 6 years ago

Do they not appear in the output at all?

rossj commented 6 years ago

At the low-level tokenization level, escaped backslashes and curly braces are treated as other control symbols / words, so \{, \par, \ldblquote, \someRandomThing are all emitted as type CONTROL, as technically they are, and it's up to the higher level stream (DeEncapsulate as an example) to interpret these special control symbols / words as text.

It would be possible for the tokenizer to emit these special control words as text, but then it would need knowledge about the meanings / symantics of specific control words. I wanted to keep the tokenizer just focused on the syntax, so I left this symantic interpretation as the responsibility of a higher level.

I'm willing to listen to any arguments against this, however. If you're using just the tokenizer and not the de-encapsulater, it may be a bit too low-level... perhaps there is room for another layer in between that does more symantic interpretation but isn't focused solely on de-encapsulation.

Please let me know if you have an example of slashes or curly braces not coming out as text through the DeEncapsulate layer.