miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
811 stars 113 forks source link

Blocks in between footlink definitions are skipped when parsing #132

Closed pbodnar closed 1 year ago

pbodnar commented 2 years ago

Once again, this is a problem found in the Footnotes parsing and while such a use is probably pretty rare, it would be good to have this fixed.

An example is given within a comment of test/test_block_token.py:

    # this tests an edge case, it shouldn't occur in normal documents
    def test_parse_with_para_right_after(self):
        lines = ['[key 1]: value1\n',
                 # 'something1\n', # if uncommented,
                 #     this and the next line should be treated as a paragraph
                 #     - this line gets skipped instead now
                 '[key 2]: value2\n',
                 'something2\n',
                 '\n',
                 '[key 3]: value3\r\n', # '\r', or any other whitespace
                 'something3\n']
        token = block_token.Document(lines)
        self.assertEqual(token.footnotes, {"key 1": ("value1", ""),
                                           "key 2": ("value2", ""),
                                           "key 3": ("value3", "")})
        self.assertEqual(len(token.children), 2)
        self.assertIsInstance(token.children[0], block_token.Paragraph)
        self.assertEqual(token.children[0].children[0].content, "something2")
        self.assertEqual(token.children[1].children[0].content, "something3")

The cause of this problem seems to be that while searching for next footlink definition start (i. e. [) within a block of adjacent lines, newlines are not considered at all, so lines containing anything else (like the paragraph something1) are simply skipped from the parsing process.

Moreover, when looking at CommonMark spec, it says that:

A link reference definition cannot interrupt a paragraph.

So in the example test above, only the first line needs to be treated as a link reference definition and all the remaining lines need to be treated as a single paragraph, i. e. as:

<p>something1
[key 2]: value2
something2</p>
pbodnar commented 2 years ago

So I've done some analysis, feel free to comment / correct me / suggest a solution. ;) I think I can come up with a fix, but I guess it will require a little bit more coding than in the previous fixes, as a good part of the current parsing logic has to be changed probably...

pbodnar commented 2 years ago

Also found another problem: number of spaces before opening [ is not checked in the lines following right after a link reference definition. So the following is invalidly parsed as 2 definitions instead of 1 definition + 1 code block:

[link]: /bla
    [i-am-block-actually]: /foo
pbodnar commented 1 year ago

Resolved by #160.