Closed sliekens closed 9 years ago
The current behavior for parsing linear white space is to assume that any CRLF
token announces that the next token is WSP
. An error is reported whenever this assumption is not fulfilled.
This implies that the parser cannot parse grammars that contain the LWSP CRLF
combination. Expanded form: *(WSP / CRLF WSP) CRLF
.
https://github.com/StevenLiekens/text-parser/blob/master/src/Text/src/Core/LWspLexer.cs
When LWspLexer.Read()
is called for "\r\n"
, an exception is thrown.
When LWspLexer.TryRead()
is called for "\r\n"
, the method returns false
and the output parameter is set to invalid token data. Although the token data is invalid, it is still useful for analysing the error and recovering from it.
If this behavior is insufficient, I'm afraid that we'll have to implement backtracking, so that the parser can be rewound to the first CRLF
token that isn't linear white space.
There is one more corner case that I can think of. Consider the following grammar:
Token = LWSP CRLF WSP
In this case, the LWSP
parser happily parses the entire input string. The problem is that the CRLF
parser will now report that we have unexpectedly reached the end of input. Also, the WSP
parser is never even reached.
Fixed all these things in a recent reimplementation
Consider the grammar rule for LWSP (Linear White Space) tokens:
Does this grammar allow derivations that contain LWSP tokens immediately followed by CRLF tokens?
If yes, how should the parser behave when a string contains only CRLF?
Two options:
If option (1), how should the parser behave when the LWSP token is the last token in the grammar?
Two options: