johannhof / markdown.rs

Rust Markdown parsing library
Apache License 2.0
136 stars 44 forks source link

Code blocks with dashed header elements don't parse correctly #35

Open schell opened 4 years ago

schell commented 4 years ago

First of all, thank you for this great crate!

It seems that if a code block contains a markdown element like a header, the code block is cut short. Here is an example - given this input markdown (with triple backticks "escaped" for this github comment):

Here is a test:

\```haskell
--------------------------------------------------------------------------------
-- Big crazy comment
--------------------------------------------------------------------------------
data MyType =
    Variant1
  | Variant2
  | Variant3
  deriving (Show)
\```

That was the test.

We would expect that the output tokens would be something like:

[
    Paragraph(
        [
            Text(
                "Here is a test:",
            ),
        ],
    ),
    CodeBlock(
        Some(
            "haskell",
        ),
        "----------------------------------------\n-- Big crazy comment\n-------------------------------------\ndata MyType =\n    Variant1\n  | Variant2\n  | Variant3\n  deriving (Show)",
    ),
    Paragraph(
        [
            Text(
                "That was the test.",
            ),
        ],
    ),
]

but instead we see:

[
    Paragraph(
        [
            Text(
                "Here is a test:",
            ),
        ],
    ),
    Header(
        [
            Code(
                "`",
            ),
            Text(
                "haskell",
            ),
        ],
        2,
    ),
    Header(
        [
            Text(
                "-- Big crazy comment",
            ),
        ],
        2,
    ),
    Paragraph(
        [
            Text(
                "data MyType =",
            ),
        ],
    ),
    CodeBlock(
        None,
        "Variant1",
    ),
    Paragraph(
        [
            Text(
                "| Variant2",
            ),
            Text(
                "\n",
            ),
            Text(
                "| Variant3",
            ),
            Text(
                "\n",
            ),
            Text(
                "deriving (Show)",
            ),
            Text(
                "\n",
            ),
            Code(
                "`",
            ),
        ],
    ),
    Paragraph(
        [
            Text(
                "That was the test.",
            ),
        ],
    ),
]
gennyble commented 4 years ago

This happened because we look first for the setext header before the code fence, so that matches first which is obviously not right. The CommonMark spec says: "The lines of text must be such that, were they not followed by the setext heading underline, they would be interpreted as a paragraph: they cannot be interpretable as a code fence, ATX heading, block quote, thematic break, list item, or HTML block."

I've moved the setext match to be the last one we check and pushed it to the master branch. I can't update on crates.io, so I'll gently ping @johannhof. I'll leave this issue open until word from them.

While you're pinged I'd like to ask if your goal with the library has been to follow CommonMark? I don't believe the current implementation of ordered lists complies with the spec. I'd be happy to get it there if you think we should. (sorry for being freakishly inactive and thank you for adding me as a maintainer, I'll start doing that now)

hoijui commented 3 years ago

@gennyble following CommonMark would be really great! we would need that too. in the need, we would also need some features from GFM and pandoc's MD, but those are tiny additions we could hack on top, just for our case. Can we help in any way?

Isn't there some parser description for CommanMark already, which just would have to be ported to the syntax of a rust parser library?