trentm / python-markdown2

markdown2: A fast and complete implementation of Markdown in Python
Other
2.66k stars 433 forks source link

Custom extras #519

Closed Crozzers closed 8 months ago

Crozzers commented 1 year ago

This PR closes #382 by adding support for custom extras. @JeffAbrahamson, @Einenlum and @kjaymiller, you've all expressed interest in this feature, so I'd love to hear any thoughts from you all.

How it works

Extras

Extras must inherit from the new Extra class. See the Strike extra for a basic example of the implementation. Each extra must define the following properties:

Extra options

The new extras support passing in extra parameters, like with existing extras. These are passed to the initialiser of the extra via the extras dict in the Markdown class (see _setup_extras)

Stages

Converting markdown to HTML happens in distinct stages, such as preprocessing, hashing HTML, processing lists, etc. These distinct stages are given numeric IDs in the Stage class which denote the order they run in.

The Markdown class methods that correspond to each stage get decorated with @Stage.mark. For example:

class Markdown:
    ...
    @Stage.mark(Stage.PREPROCESS)
    def preprocess(self, text):
        ...

Now, when preprocess is called, a wrapper function will find all registered extras that are in the same band as this stage and execute them.

Order of execution

As mentioned previously, each stage has its own numeric ID, each 100 apart from the last. Calling Stage.before/after(Stage.WHATEVER) will return that stage's ID minus/plus 5. Conescutive calls increment an internal counter and will continue subtracting/adding 5. For example:

Stage.PREPROCESS  # 50
Stage.after(Stage.PREPROCESS)  # [55]
Stage.after(Stage.PREPROCESS)  # [60]

This means that multiple extras can be registered for the same stage and will be executed in the order in which they were registered.

Known flaws

Crozzers commented 1 year ago
  • Registering extras is a bit of a mess at the moment
    • All subclasses of Extra are auto-detected and registered during _setup_extras. I intend to change so extras must explicitly register themselves

Changed in 5478960 so that extras must manually registered by calling the Extra.register function. During Markdown._setup_extras it will now iterate over all registered extras and instantiate any that are found in self.extras using the parameters from self.extras

I have also dropped Python 3.5 support since it was throwing errors in the test suite due to some type hints. Python 3.5 was EOL 2 years and 9 months ago so this shouldn't be too much of an issue

nicholasserra commented 1 year ago

Somehow this slipped through the cracks for me. I'll get it into my queue

Crozzers commented 1 year ago

No worries! Might also be worth pulling in some downstream project maintainers for feedback. Off the top of my head, @mhils (pdoc) and @rouilj (roundup) come to mind. Since both of these projects use the library, I imagine there will be some ideas for what the feature should look like.

rouilj commented 1 year ago

Thanks for the ping.

Markdown2 is just one of three supported markdown engines. Each is configured a little differently with different capabilities. The things I would use this for (recognizing autolinked phrases like issue1) etc are done by changing the text seen by markdown2 so it includes the proper markdown syntax for the links.

So I try to do all customization via markdown syntax. I do enable options (e.g. fenced blocks, line breaks on newline) that are supported on all the engines.

That said, I do have custom code for markdown and mistune that handle setting the rel attributes on links or try to make embedded html safer.

Not sure if the customization methods of other engines may be useful feedback. But you can see what other contributers have done at:

https://github.com/roundup-tracker/roundup/blob/master/roundup/cgi/templating.py#L122-L235

mhils commented 1 year ago

Thanks for the heads-up! This looks like a sensible refactor, no complaints really. :)

From an API standpoint I'm not sure if having a dedicated test method is particularly useful (extras could just do this check as the first part of run if they want to improve performance?`), but overall this all looks good.

Crozzers commented 1 year ago

Apologies for the slow reply, Thanks all for the feedback!


@rouilj I took a look at the examples you linked, and tried my hand at implementing a few of them. All worked out pretty well, although a bit more manual work was needed since this library lacks facilities like the TreeProcessor base class. Instead of iterating over a list of elements, I parsed the output text for <a href... and replaced the hashed links. Not nearly as elegant but definitely achievable.

The developer experience could be made even better by perhaps implementing a LinkPostProcessor base class that finds all <a> tags, parses the HREF and REL attrs and passes it to an internal sub method that can edit the text for final output.

class LinkPostProcessor(Extra, ABC):
  order = Stage.after(Stage.LINKS)
  def test(self, text):
    return True
  @abstractmethod
  def sub(self, match):
    ...
  def run(self, text):
    return re.sub(r'(<a.*?)(?P<href>href="md5-[a-z0-9]{32}")(.*>)', self.sub, text)

class MyExtra(LinkPostProcessor):
  name = 'my-extra'
  def sub(self, text):
    return f'[REMOVED A LINK WITH WITH HREF {match.group("href"}]'

I could see such convenience classes being quite useful for quickly creating extras. Would likely implement as and when needed, probably to simplify existing extras that function in similar ways. It's not a direct replacement for TreeProcessor and the like but hopefully that sort of idea will do.


@mhils The separated test method was something that I saw in Markdown's extension API docs and replicated here. Personally I think separating this functionality from the run function is cleaner and the explicit method is a good reminder to think about performance. That said, if one didn't care for performance, including a def text(*_): return True method for each extra would be tedious so perhaps having the base implementation always return True with the option for overriding it would be a good compromise?

mhils commented 11 months ago

Thanks @Crozzers! 🍰

Personally I think separating this functionality from the run function is cleaner and the explicit method is a good reminder to think about performance.

This sounds reasonable to me - I think an explicit reminder may be a good idea. :) In that case it should probably not have a default implementation, otherwise that defeats that purpose. 😄

We'd have our first real use case for custom extras over in https://github.com/mitmproxy/pdoc/issues/639, so while no hurries please I'm looking forward to this landing at some point. 😃

Crozzers commented 10 months ago

@nicholasserra would it be alright to revisit this PR? With all the recent changes to different extras it's getting a bit cumbersome to keep porting them back to this branch. There are also a couple of issues that would benefit from this:

Many thanks

nicholasserra commented 10 months ago

Sure thing i'll get this on my priority list thank you

nicholasserra commented 10 months ago

This is looking pretty fancy to me. I like it. Anythings better than the free for all we have right now. Still wrapping my head around the numerical aspect of the stages though. What's that getting us that something like an ordered list or dict isn't? I need to read through the code a bit longer to properly grasp what's happening.

But overall I think this should be great and we'll land it as-is or with minor tweaks. Just give me a little more time to think on it.

Crozzers commented 9 months ago

Still wrapping my head around the numerical aspect of the stages though. What's that getting us that something like an ordered list or dict isn't?

An ordered list would be simpler and easier and would dodge this flaw in the number system:

Calling Stage.before 10 times on the same stage will result in extras overflowing into the next stage

The only issue is: if an extra is tied to a particular stage and that stage gets skipped for some reason, do we still want to run that extra?

With the numbering system, we can say that anything where 100 <= order <= 200 is tied to HASH_HTML and we can skip them. With an ordered list, it would look like this:

[Stage.PREPROCESS, FencedCodeBlocks, Stage.HASH_HTML, FencedCodeBlocks, Stage.LINK_DEFS]

FencedCodeBlocks is registered as before LINK_DEFS and after PREPROCESS, but with an ordered list we can't tell and therefore can't skip it.

We could instead have ordered lists for each extra. For example:

# I've manually filled out the lists here but this would be filled by calling `Stage.before/after`
class Stage:
    # tuple containing the `before` and `after` lists respectively. Could be done as a namedtuple
    PREPROCESS = ([], [FencedCodeBlocks])
    HASH_HTML = ([], [])
    LINK_DEFS = ([FencedCodeBlocks], [])

This solves the issue of numbers overflowing whilst still allowing us to tie extras to particular stages. The only downside is if an extra sets its order as Stage.before(FencedCodeBlocks), it's a bit more tedious to insert it in all the right lists, but this only happens at startup so isn't a huge concern.

I'll play around with this and see what happens

Crozzers commented 9 months ago

@nicholasserra, I've refactored the ordering system for extras. I've moved most of the ordering logic into the Extra class and made Stage into a proper enum. The big change is that the numbering system is now gone.

The type of the Extra.order attribute is now a tuple of two iterables. This represents "before" and "after". EG:

class MyExtra(Extra):
  # run before preprocessing, after HASH_HTML and after the FencedCodeBlocks extra is run
  order = [Stage.PREPROCESS], [Stage.HASH_HTML, FencedCodeBlocks]

When you call MyExtra.register, it will calculate when this extra should be triggered and store this info in Extra._exec_order. When all the extras are registered, it looks something like this:

{<Stage.PREPROCESS: 1>: (
                         [],  # before
                         [<class 'markdown2.Mermaid'>,  # after
                          <class 'markdown2.Wavedrom'>,
                          <class 'markdown2.FencedCodeBlocks'>]),
 <Stage.HASH_HTML: 2>: (
                         [],  # before
                         [<class 'markdown2.MarkdownInHTML'>]),  # after
[...]
 <Stage.ITALIC_AND_BOLD: 14>: (
                         [<class 'markdown2.Underline'>,  # before
                          <class 'markdown2.Strike'>,
                          <class 'markdown2.MiddleWordEm'>,
                          <class 'markdown2.CodeFriendly'>],
                         [<class 'markdown2.Breaks'>,  # after
                          <class 'markdown2.MiddleWordEm'>,
                          <class 'markdown2.CodeFriendly'>,
                          <class 'markdown2.TelegramSpoiler'>])}

As mentioned previously, the Stage class is now an IntEnum containing the different stages. This was done so that extras could check the current stage of execution, and whether they were being called before/after a particular stage. This is useful for extras that need to be called before AND after a stage.

Calling Extra.deregister now also removes it from Extra._exec_order, allowing users to potentially change the order of an extra on the fly.

Finally, in the changelog I've put these changes under the most recent header, which is 2.4.13, but I suspect this change may warrant a bump to 2.5.0

nicholasserra commented 8 months ago

@Crozzers This is all looking amazing. I think i'm ready to merge this in. Thank you for all the work on this!

I see you have some other open PRs. How would you like to proceed here? Merge this in now, and then fix up the others (if needed), or should I merge those others in and this PR can be rebased? Your call.

I think i'll do a release of 2.4.13 before this lands. Then we can move these changes into 2.5.0 as you recommended. I'll just edit the changelog manually when it comes time.

Crozzers commented 8 months ago

None of the other PRs are particularly high priority and #568 requires some further investigation so I'd say merge this one first and I can update the others if needed. They're small patches anyway so shouldn't need much doing to them.

Thanks for taking the time to review this PR, it ended up as a bit of a monster

nicholasserra commented 8 months ago

Ok then i'm gonna do a release of the current changes, then merge this, then manually bump us to 2.5.0 after