miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
841 stars 119 forks source link

Fix penultimate sections of multiline fragments being ignored by the Markdown renderer #201

Closed Lordfirespeed closed 1 year ago

Lordfirespeed commented 1 year ago

When using the Markdown renderer, if a fragment containing newline characters is encountered, it is split by \n and each line is yielded in turn.

 for fragment in fragments:
    if "\n" in fragment.text:
        lines = fragment.text.split("\n")
        yield current_line + lines[0]
        for inner_line in lines[1:-2]:
            yield inner_line
        current_line = lines[-1]
    else:
        current_line += fragment.text

However, with a quick check in the python REPL we find:

>>> "abcde"[0] 
'a'
>>> "abcde"[1:-2]
'bc'
>>> "abcde"[-1]
'e'

Notably, d, the penultimate element, is being skipped. This can be fixed by changing the bounds of the 'inner' slice from [1:-2] to [1:-1].

coveralls commented 1 year ago

Coverage Status

coverage: 94.113% (+0.04%) from 94.073% when pulling 7aaad0d115800e9894e163a6876debe1841c87cd on Lordfirespeed:patch-1 into 70e8e8dbc24b836e554f46e7f23c2cabc7dbd77a on miyuchina:master.

pbodnar commented 1 year ago

@Lordfirespeed, good catch, thank you. Would you mind to also add a unit test on this? I'm a little bit struggling to find a use case where a Fragment would be created with text including \n, i.e. when the problematic part of code would be called "in real life". (Or maybe @anderskaplan as the author of MarkdownRenderer will know? :))

Lordfirespeed commented 1 year ago

My case is a bit of an odd one, apologies!

I wanted to parse LaTeX so that I could use regex inside latex blocks to replace command sequences e.g.

Because I was trying to migrate my notes from KaTeX to MathJax.

I 'wrote' this very basic class:

from mistletoe.markdown_renderer import MarkdownRenderer
from mistletoe.latex_token import Math
from itertools import chain

class MathMarkdownRenderer(MarkdownRenderer):
    def __init__(self, *extras, **kwargs):
        super().__init__(
            *chain(
                (Math,),
                extras,
            ),
            **kwargs
        )

    def render_math(self, token):
        return self.render_raw_text(token)

And used it like so:

from mistletoe import Document as MarkdownDocument
from math_markdown_renderer import MathMarkdownRenderer

with MathMarkdownRenderer() as renderer:
    with open(file, "r") as infile:
        document = MarkdownDocument(infile)
    # I made changes to the parsed document here
    render = renderer.render(document)

I found that multiline LaTeX would almost always contain newline characters, e.g. the following:

$$
a=b
$$

would yield a Math token with the following content:

"$$\na=b\n$$"

So, in short, yes, I can add a unit test; but not without also adding functionality to the library 😅

anderskaplan commented 1 year ago

Good catch!

The easiest way to trigger the bug in a unit test would be to modify one of the existing tests. For example:

    def test_images_and_links(self):
        input = [
            "[a link](#url (title))\n",
            "[another link](<url-in-angle-brackets> '*emphasized\n",
            "multi-line\n",
            "title*')\n",
            '![an \\[*image*\\], escapes and emphasis](#url "title")\n',
            "<http://auto.link>\n",
        ]
        output = self.roundtrip(input)
        self.assertEqual(output, "".join(input))

(The only change is that the line "multi-line" has been added. The bug will only trigger when there are more than two rows.)

pbodnar commented 1 year ago

@anderskaplan, thanks a lot, great to have you back! :)

@Lordfirespeed, so I guess you can just add "multi-line\n", to the existing unit test as Anders suggests, and that's it. I would merge the PR afterwards.

Lordfirespeed commented 1 year ago

@anderskaplan, thanks a lot, great to have you back! :)

@Lordfirespeed, so I guess you can just add "multi-line\n", to the existing unit test as Anders suggests, and that's it. I would merge the PR afterwards.

I'm not sure that particular example would cause the bug to happen as [0] and [-1] are both handled specifically, so the multiline fragment needs to contain at least two newlines - but I get what you mean! I'll get on that.

Edit: Oh! sorry, never mind. I was confused.

Lordfirespeed commented 1 year ago

All done! I have not confirmed the unit test functions as intended as I don't have a codespace open at the moment.

pbodnar commented 1 year ago

Thank you!