I was a bit suspicious about the XWiki grammar file so I analyzed it and
changed it to make the grammar strictly LL(1) (no
lookaheads in the parser). The result can be summarized as:
+ Strictly LL(1). This means that it is safe to control the token
manager from the java code. (The parser is also faster, but I do
not expect the different to be noticable.)
+ The actual grammar part is more compact and easier to follow.
- In the lexical scanner switches between states are made from the
java code. Thus, it is harder to follow.
- It relies on controlling the lexical scanner from the java code,
thus it is no longer safe to use lookaheads in the parser.
Personally, I think that the new version is much easier to work with.
I am very frustrated with javacc and I find it too primitive to be
convenient. I think that various tricks with the scanner simplifies
matters somewhat, and therefore it is important to use parser
lookahead conservatively. Also, I think that it is more important
that the parser is clear and concise than that the scanner is.
I have taken care to not change the behavior of the parser with two
exceptions: dangling ))) tokens are treated as special characters. (A
comment in the grammar file suggested that this was the desired
behavior.) Also, I explicitly open up a new paragraph for content
after macro and verbatim blocks, rather than relying on that the
listener will infer such an opening.
The new version passes all unit tests, except the dangling )))-test,
and successfully parses the page XWiki.XWikiSyntax. Thus, it is
likely that swapping to new version would work, but I still wouldn't
bet my life on that nothing would break.
Below is a summary of things that I find surprising in the XWiki
grammar, and we should consider changing:
Unexpected: Parameters at the beginning of line terminates a block or a
paragraph, even if they are inline to the line.
Expected: Only empty lines or appropriate end-tokens may terminate a
block or a paragraph. Alternatively, the parameters may
terminate a block or paragraph if they are indeed block
parameters.
Unexpected: No empty lines event is generated if there are exactly two
new lines before something that is not a paragraph.
Expected: Empty lines event should always be generated on
sequences of two or more new lines characters.
Unexpected: Header tokens can be arbitrarily long sequences of equal
signs. If there are more than six, header level 6 is
chosen. The optional end token does not need to match
the start token.
Expected: Sequences of more than 6 equal signs should be special
characters. Mismatching end tokens should be special
characters.
Unexpected: If there are block parameters before an empty line, an
empty paragraph is generated and the parameters are
applied to these.
Expected: Block parameters should be applied to the following
header, table, paragraph etc.
There are also some things I think are strangely implemented in the
old version:
* A line of text that follows a macro block or a verbatim block
implicitly opens up a new paragraph and close it explicitly. Why
not also open it explicitly?
* The first cell in a table row is parsed completely differently from
the following cells. Sometimes the function that parses cell
contents returns before reaching the end of the cell, and the rest
is parsed as ordinary inlined contents. Since the listener API
provides the onTableCell and the end of the cell is inferred the
result will be as expected anyway. Looks funny though. Example:
|a(((b)))c|a(((b)))c
Original issue reported on code.google.com by AndreasZ...@gmail.com on 18 Jan 2010 at 6:33
Original issue reported on code.google.com by
AndreasZ...@gmail.com
on 18 Jan 2010 at 6:33