sirthias / pegdown

A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions
http://pegdown.org
Apache License 2.0
1.29k stars 218 forks source link

Empty code block #222

Open AntoniBertel opened 8 years ago

AntoniBertel commented 8 years ago

Hi,

```javascript\n\n```

Github will be display this code as empty code block, but parser will be display it as plain markdown.

Before fix:

<p>```javascript</p>
<p>```</p>

After fix:

<p><code>javascript

</code></p>

Please see pull request - https://github.com/sirthias/pegdown/pull/221

AntoniBertel commented 8 years ago

Hi,

I can't use pegdown in live with this issue. Please look at the pull request.

vsch commented 8 years ago

@AntoniBertel, your proposed change does address the real issue which is in the CodeFenceBlock rule not the inline Code rule.

GitHub renders your code sample as a code fence with a blank line <pre><code>....</code></pre>, your fix renders it as <p><code>...</code></p>, which is inline code with a blank line. The latter is not allowed under Markdown rules for inlines.

To get pegdown to accept code fenced blocks with blank lines you need to make changes to Parser.java and to DefaultVerbatimSerializer.java

Parser.java need the following function to be changed to:

    public Rule FencedCodeBlock() {
        StringBuilderVar text = new StringBuilderVar();
        Var<Integer> markerLength = new Var<Integer>();
        return NodeSequence(
                // vsch: test to see if what appears to be a code fence is just inline code
                // vsch: change to accept code fence with all blank lines
                CodeFence(markerLength),
                OneOrMore(
                        TestNot(CodeFence(markerLength)),
                        ZeroOrMore(TestNot(Newline()), ANY, text.append(matchedChar())),
                        Sequence(Newline(), text.append(matchedChar()))
                ),
                push(new VerbatimNode(text.getString(), popAsString())),
                CodeFence(markerLength), drop()
        );
    }

DefaultVerbatimSerializer is not expecting all blank lines in the text so it needs test for end of text:

    @Override
    public void serialize(final VerbatimNode node, final Printer printer) {
        Attributes attributes = new Attributes();

        if (!StringUtils.isEmpty(node.getType())) {
            attributes.add("class", node.getType());
        }

        printer.println().print("<pre><code").print(printer.preview(node, "code", attributes, false)).print('>');
        String text = node.getText();
        // print HTML breaks for all initial newlines
        while (!text.isEmpty() && text.charAt(0) == '\n') {
            printer.print("<br/>");
            text = text.substring(1);
        }
        printer.printEncoded(text);
        printer.print("</code></pre>");

    }
AntoniBertel commented 8 years ago

@vsch, Thank you very much for the answer, explanation and solution! I fully agree that semantically correct to print 'pre' tag, as it does github. When you are going to release a new pegdown version? I ask this because it would be great to just increase version in dependency manager :).

vsch commented 7 years ago

@AntoniBertel, I am no longer using pegdown in my project because of the difficulty in maintaining and modifying pegdown parser, its performance issues and long or infinite loop parsing on some pathological input like [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[.

I rewrote commonmark-java to replace pegdown in my Markdown Navigator plugin for IntelliJ IDEs: https://github.com/vsch/idea-multimarkdown. The parser project is https://github.com/vsch/flexmark-java has very detailed source based AST with source offset for every part of the element. I need that for syntax highlighting and other plugin source reliant features.

It is CommonMark 0.27 (GitHub Comments) compliant but In the last release I added list parsing options to emulate list indentation rules used by: markdown.pl, MultiMarkdown (like pegdown 4 space indents) and kramdown (GitHub Docs). The only extensions that pegdown has that I did not yet implement are: typographic quotes, smarts and definition lists. The rest of the extensions are available, with some extra ones that pegdown does not have.

As an added bonus and what motivated me to write my own parser, the parsing is 30-50x faster than pegdown on average documents and several thousand times faster on pegdown's pathological input.

The AST is very regular, part of all the tests not just ones geared for AST testing and source offsets are free of idiosyncrasies and easily adjusted. The AST is fully modifiable unlike pegdown's with next, prev and parent links. There are many ways to extend the parser. I geared it for extensibility and made sure handling of AST nodes is uniform whether it is part of the core or part of an extension. So extensions can and do add their own nodes to the AST.

I also have code to convert from pegdown Extensions flags to my uniform configuration API that I used in the initial migration from pegdown to flexmark-java. So if you are interested in trying it out let me know and I will give you a hand in migrating to it.