'LWSP' followed by 'CRLF'

sliekens commented 9 years ago

Consider the grammar rule for LWSP (Linear White Space) tokens:

LWSP =  *(WSP / CRLF WSP)

Does this grammar allow derivations that contain LWSP tokens immediately followed by CRLF tokens?

Token =  LWSP CRLF

If yes, how should the parser behave when a string contains only CRLF?

// Token =  LWSP CRLF
string input = "\r\n";

Two options:

Assume CRLF is not LWSP, because CRLF is not followed by WSP
Return a syntax error: expected WSP; found EOF

If option (1), how should the parser behave when the LWSP token is the last token in the grammar?

// Token =  LWSP
string input = "\r\n";

Two options:

Return a syntax error: expected WSP; found EOF
Return a syntax error: expected EOF; found CRLF

sliekens commented 9 years ago

The current behavior for parsing linear white space is to assume that any CRLF token announces that the next token is WSP. An error is reported whenever this assumption is not fulfilled.

This implies that the parser cannot parse grammars that contain the LWSP CRLF combination. Expanded form: *(WSP / CRLF WSP) CRLF.

https://github.com/StevenLiekens/text-parser/blob/master/src/Text/src/Core/LWspLexer.cs

When LWspLexer.Read() is called for "\r\n", an exception is thrown.

When LWspLexer.TryRead() is called for "\r\n", the method returns false and the output parameter is set to invalid token data. Although the token data is invalid, it is still useful for analysing the error and recovering from it.

If this behavior is insufficient, I'm afraid that we'll have to implement backtracking, so that the parser can be rewound to the first CRLF token that isn't linear white space.

sliekens commented 9 years ago

There is one more corner case that I can think of. Consider the following grammar:

Token = LWSP CRLF WSP

In this case, the LWSP parser happily parses the entire input string. The problem is that the CRLF parser will now report that we have unexpectedly reached the end of input. Also, the WSP parser is never even reached.

sliekens commented 9 years ago

Fixed all these things in a recent reimplementation

sliekens / Txt

'LWSP' followed by 'CRLF' #1