miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
811 stars 113 forks source link

Markdown renderer #162

Closed anderskaplan closed 1 year ago

anderskaplan commented 1 year ago

This is a pull request for a Markdown-to-Markdown renderer (#4).

A few notes on the implementation:

  1. Since the aim is to preserve as much of the formatting as possible, and also steer clear of issues like if the sequence _**Hello**_ would be rendered as ***Hello*** (which changes its interpretation), I had to save some formatting information from the parsing in the tokens. The formatting info will not appear in the output from the ASTRenderer or from repr().

  2. The Markdown renderer can also do word wrapping. (Because I needed that for a use case.) But that's optional, and when it's not enabled, the goal is still to make the roundtrip Markdown -> parse -> render -> Markdown with as few changes as possible.

  3. The test cases in TestMarkdownRenderer give a pretty good idea of what the renderer is capable of. In particular, note that it handles nested blocks: you can put a code block inside a list inside a block quote if you like. (This is explicitly allowed by the CommonMark spec.)

  4. Three new tokens! Well actually, only one which is completely new and that's the BlankLine. The other two represent link reference definitions (a.k.a. footnotes) and inherit from theFootnote token. These tokens are used to represent content that is otherwise not included in the AST.

    An open question is whether they belong in the same module as the markdown renderer class, or if they should be moved somewhere else. Maybe create a new module to host all optional tokens?

  5. The markdown renderer handles span tokens and block tokens differently. Block tokens are rendered into chunks of lines. Span tokens are first rendered into sequences of "fragments", which are essentially strings with some additional metadata attached, and then into chunks of lines.

  6. Tables are supported.

  7. I put the MarkdownRenderer in the main package, not in contrib, because that's what @miyuchina wrote in a comment on #4.

Hope this can be useful and looking forward to some review comments. As it's a fairly large PR, I'd recommend to review it commit by commit. All changes to existing code come first, and then finally the renderer itself.

pbodnar commented 1 year ago

@anderskaplan, thank you for another contribution. :)

I will try to check it out soon, but of course, I cannot promise anything. Hopefully we'll also get feedback from some other contributors this time. (I think it would be ideal to firstly finish all the current smaller changes (when compared to this one) and finally release version 1.0.0 and this one could go for example into 1.1.0, but we'll see...)

7. The markdown renderer has a main method. It can be convenient, since the renderers in contrib aren't easily run from the regular mistletoe cli. Not sure if it should be kept. (Or whether the markdown renderer should stay in the contrib directory or be part of core?)

Yeah, I'm not a big fan of the "contrib" folder as well, I would probably vote for getting rid of it ASAP, see #101 for more details & discussion.

karlicoss commented 1 year ago

hey, any updates on this? :)

anderskaplan commented 1 year ago

Hi @karlicoss, this branch is dependent on #172. As soon as that PR has been merged, I'll be able to prepare this one for showtime.

coveralls commented 1 year ago

Coverage Status

coverage: 90.359% (+1.2%) from 89.174% when pulling 61cec4a5d5c12d237afd9b6f9bc32c0d63096dbb on anderskaplan:markdown-renderer into e853a0def73f4d4c2be5bf16ef4ef4135df0d556 on miyuchina:master.

geolta commented 1 year ago

Hi, I had no deep look into the code but I hope that someone else knows the answer: Is implementing a (live) syntax highlighter based on mistletoe possible after these changes? As far as I have seen, it is not the positions that are saved (which would be best for my specific case), but the syntax elements like **. But that shouldn't be a problem anyway, since the Markdown renderer should be able to be modified so that the positions can be pulled out of there again, right?

And: Really thank you all for this great Markdown parser!

anderskaplan commented 1 year ago

Hi @geolta , it depends a lot what you mean by "(live) syntax highlighter". I suppose "live" means that it highlights while the user is typing. But what is it you want to highlight -- is it for example headings, emphasized text, and the like? And is this for a user interface written in python? If the answer is yes in both cases, then I think mistletoe should be a good match.

pbodnar commented 1 year ago

It's been a while since the beginning of this PR. I think that we could finish just the minor cleanups and put this long awaited feature into the very next release version, with a warning that the renderer's API is in a "preview" state and it might change in a future version. What do you think?

anderskaplan commented 1 year ago

Sorry for the slow responses lately @pbodnar -- It's been a very busy May.

I fully agree that it would be very nice to get this into the next major release, and I'll try and resolve the remaining comments as soon as possible.

anderskaplan commented 1 year ago

This question should also be addressed before merging, imho:

Four new tokens! An open question is whether they belong in the same module as the markdown renderer class, or if they should be moved somewhere else. Maybe create a new module to host all optional tokens?

I think moving all optional tokens into a module could be pretty neat. What do you say @pbodnar ?

pbodnar commented 1 year ago

This question should also be addressed before merging, imho:

Four new tokens! An open question is whether they belong in the same module as the markdown renderer class, or if they should be moved somewhere else. Maybe create a new module to host all optional tokens?

I think moving all optional tokens into a module could be pretty neat. What do you say @pbodnar ?

Or maybe 3 new tokens currently if I look correctly (Fragment class not included)? I think I like them where they are now, but maybe if you give me a better idea of for example their reuse from other (future) modules, I could change my mind? :)

anderskaplan commented 1 year ago

This question should also be addressed before merging, imho:

Four new tokens! An open question is whether they belong in the same module as the markdown renderer class, or if they should be moved somewhere else. Maybe create a new module to host all optional tokens?

I think moving all optional tokens into a module could be pretty neat. What do you say @pbodnar ?

Or maybe 3 new tokens currently if I look correctly (Fragment class not included)? I think I like them where they are now, but maybe if you give me a better idea of for example their reuse from other (future) modules, I could change my mind? :)

Three they are. I think they were four at some point, but the last one turned out not to be necessary.

In any case, if you think it's ok that they stay where they are, then I'll just leave them there.

anderskaplan commented 1 year ago

There, I believe all comments are addressed now. 😄

anderskaplan commented 1 year ago

Here we go!