sirthias / pegdown

A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions
http://pegdown.org
Apache License 2.0
1.29k stars 218 forks source link

Blank Line and Node Source Position handling is off in the parser and some current tests expect this. #173

Closed vsch closed 9 years ago

vsch commented 9 years ago

I have issues with ListItem parsing that I am addressing. However, the fix results in failed regression tests, which are wrong. In some cases it is a wrong expectation based on the md source. In other cases it is a wrong expectation of the Node source range based on the current parser's incorrect implementation.

Wrong expectation from markdown source

Here is one that is expecting wrong HTML generation based on existing errors in list parsing, from resources/pegdown/Bug_in_1.0.0.md

* List Item A

        This is a verbatim line
        and another one

        all of these should become
        part of the

        same verbatim block

    only these lines here

        should interrupt
        the

        verbatims blocks

* another List Items

and something completely different here

It is expecting the * another list item to be tightly spaced, but this is wrong because it has a blank line preceding it. To match the expected HTML the md needs to have no blank line before the second list item as in:

* List Item A

        This is a verbatim line
        and another one

        all of these should become
        part of the

        same verbatim block

    only these lines here

        should interrupt
        the

        verbatims blocks
* another List Items

and something completely different here

Wrong Expectation of Node source pos range

The current parser erroneously appends "\n\n" to the block, if it is loose, before recursively parsing it. This causes the item's node range to be extended 2 characters into the next line. The block of text should only contain actual text recognized from the source, otherwise the text source ranges will be out of sync. It gets worse for nested lists which get to be recursively parsed again, adding another 2 characters, etc.

Manual checking is a pain, but in one case when I checked the character offsets from beginning of the source text it did confirm that the node is extended erroneously by 2 characters. However, the AST tests expect this erroneous behaviour. The same will be found in Definitions because in that case the parser rule also appends "\n\n" to the block before recursively parsing it.

Here is one that is off in the node's source position, from resources/pegdown/AstText.md:

# Ast Test

This _is_ a **simple** & small text to

> test
> yes, test

the functionality of

* AST index creation
    - one more level

* nothing more

        Some code!

The expectation for the * nothing more list item is:

ListItemNode [144-177]
  RootNode [144-159]
    ParaNode [144-159]
      SuperNode [144-156]
        TextNode [144-156] 'nothing more'
  RootNode [162-177]
    VerbatimNode [162-177] 'Some code!\n'

The corrected expectation for above should read:

ListItemNode [144-177]
  RootNode [144-157]
    SuperNode [144-157]
      TextNode [144-156] 'nothing more'
  RootNode [157-177]
    VerbatimNode [162-177] 'Some code!\n'

Highlighting the md source from 144-159 results in the following source highlight:

screen shot 2015-08-15 at 6 25 46 pm

With the parser fixed not to append characters that have not been parsed from the source the resulting node source pos changes to 144-157 which is correct and corresponds to highlighted text:

screen shot 2015-08-15 at 6 26 58 pm

Automated testing is not a magic wand

These are examples of automated testing's "Achille's heel". Just because there are lots of tests and the product passes, does not mean it is working correctly. It only means that it is working consistently. If there are tests not validated for correctness then they get in the way of correcting the code because the responsibility is on the one rocking the boat to make the erroneous tests pass.

Now I have to weigh the benefit of correct operation of the parser vs lots of extra work validating existing failing tests and correcting their flawed expectations.

I did say that doing detailed validation of parsing and the tests looked like "A career decision". It certainly is starting to pan out that way.

Does anyone have a fast way of tracing test results back to test source/expectations?

Anyone with more experience in working with sbt and pegdown source and validating tests please advise on what is available to make this a bit more efficient. What I am looking for:

  1. Is there an option for SBT so that it outputs a diff of the actual vs expected instead of plain text of the two, leaving it to me to figure out.
  2. If not, is there a way to have the actual/expected to be automatically dumped to files so that I can use a diff tool?

Otherwise, it will take a while to get this fix in as I muddle through, finding the failed tests and then validating them manually and correcting the flawed ones so that the fix passes the tests.

Cheers.

vsch commented 9 years ago

Done. It was like pulling on a string of a sweater.

sirthias commented 9 years ago

Sorry, Vladimir, for not being able to jump in and support you in your endeavour here. I was (and still am) on family vacation this month.

It's amazing to see you pull through nevertheless. Hats off to you!