LL(1) grammar for XWiki/2.0 syntax

GoogleCodeExporter commented 8 years ago

I was a bit suspicious about the XWiki grammar file so I analyzed it and
changed it to make the grammar strictly LL(1) (no
lookaheads in the parser).  The result can be summarized as:

+ Strictly LL(1).  This means that it is safe to control the token
  manager from the java code.  (The parser is also faster, but I do
  not expect the different to be noticable.)
+ The actual grammar part is more compact and easier to follow.
- In the lexical scanner switches between states are made from the
  java code. Thus, it is harder to follow.
- It relies on controlling the lexical scanner from the java code,
  thus it is no longer safe to use lookaheads in the parser.

Personally, I think that the new version is much easier to work with.
I am very frustrated with javacc and I find it too primitive to be
convenient. I think that various tricks with the scanner simplifies
matters somewhat, and therefore it is important to use parser
lookahead conservatively.  Also, I think that it is more important
that the parser is clear and concise than that the scanner is.

I have taken care to not change the behavior of the parser with two
exceptions: dangling ))) tokens are treated as special characters.  (A
comment in the grammar file suggested that this was the desired
behavior.)  Also, I explicitly open up a new paragraph for content
after macro and verbatim blocks, rather than relying on that the
listener will infer such an opening.

The new version passes all unit tests, except the dangling )))-test,
and successfully parses the page XWiki.XWikiSyntax.  Thus, it is
likely that swapping to new version would work, but I still wouldn't
bet my life on that nothing would break.

Below is a summary of things that I find surprising in the XWiki
grammar, and we should consider changing:

Unexpected: Parameters at the beginning of line terminates a block or a
            paragraph, even if they are inline to the line.
Expected:   Only empty lines or appropriate end-tokens may terminate a
            block or a paragraph.  Alternatively, the parameters may
            terminate a block or paragraph if they are indeed block
            parameters.

Unexpected: No empty lines event is generated if there are exactly two
            new lines before something that is not a paragraph.
Expected:   Empty lines event should always be generated on
            sequences of two or more new lines characters.

Unexpected: Header tokens can be arbitrarily long sequences of equal
            signs.  If there are more than six, header level 6 is
            chosen.  The optional end token does not need to match
            the start token.
Expected:   Sequences of more than 6 equal signs should be special
            characters.  Mismatching end tokens should be special
            characters.

Unexpected: If there are block parameters before an empty line, an
            empty paragraph is generated and the parameters are
            applied to these.
Expected:   Block parameters should be applied to the following
            header, table, paragraph etc.

There are also some things I think are strangely implemented in the
old version:

* A line of text that follows a macro block or a verbatim block
  implicitly opens up a new paragraph and close it explicitly.  Why
  not also open it explicitly?

* The first cell in a table row is parsed completely differently from
  the following cells.  Sometimes the function that parses cell
  contents returns before reaching the end of the cell, and the rest
  is parsed as ordinary inlined contents.  Since the listener API
  provides the onTableCell and the end of the cell is inferred the
  result will be as expected anyway.  Looks funny though. Example:

|a(((b)))c|a(((b)))c

Original issue reported on code.google.com by AndreasZ...@gmail.com on 18 Jan 2010 at 6:33

GoogleCodeExporter commented 8 years ago

Original comment by thomas.m...@gmail.com on 25 Mar 2010 at 4:58

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Issue 176 has been merged into this issue.

Original comment by thomas.m...@gmail.com on 29 Mar 2010 at 9:26

quetzai / wikimodel

LL(1) grammar for XWiki/2.0 syntax #169