markedjs / marked

A markdown parser and compiler. Built for speed.
https://marked.js.org
Other
32.81k stars 3.38k forks source link

Support custom extensions "interrupting" built-in tokens #3435

Open calculuschild opened 2 weeks ago

calculuschild commented 2 weeks ago

What pain point are you perceiving?. Not sure the best way to describe this. So currently, custom extensions have the start property which we use to interrupt the paragraph element. But there are other tokens that are interruptable according to the Commonmark/GFM spec. For example, GFM Tables must end when they encounter another block-level token.

The difficulty comes with enforcing that rule for custom extensions. Say I make a new block-level token via custom extensions

{{block
....
}}

If this were placed immediately after a Table, the table would just consume it, because it does not interact with the start property in the same way that paragraph does. You could roll your own Table tokenizer that does nothing but except add a few more characters to the Rules regex, but this seems like a lot of effort just to make your extension compatible with GFM rules.

Describe the solution you'd like I really don't know how this would be implemented, but the desire would be a way for an extension to signal which tokens it can interrupt. Or, maybe better the other way around, allow a token to specify which types of other tokens can interrupt it.

One thing to consider, is that each token is also a little different in terms of at what points it can be interrupted. Blockquotes can only be interrupted during the "lazy continuation" step. Paragraphs can be interrupted any time. Tables can only interrupted if the line starts without |. Not every token can be interrupted by the same kinds of tokens.

I kind of hacked my way around this for Tables using my own extension Marked-Extended-Tables by allowing the user to input "termination" regex that would be appended to the tokenizer and cause table to stop lexing on that line.

https://github.com/calculuschild/marked-extended-tables/blob/9e56b24598e07de71e225d6c50a50d40c366965f/src/index.js#L23-L25

Not sure if this is the easiest way to go about it, but the trickiest part is somehow applying that to the built-in tokens without just ending up rewriting every tokenizer anyway.

Mostly I'm just kind of stumped on any better way to do this.

UziTech commented 2 weeks ago

The way we interrupt paragraph is by clipping src when passing it into the tokenizer https://github.com/markedjs/marked/blob/2124b5de1e37416cf60d568f9822c11ef3b2fc89/src/Lexer.ts#L235

We could do something similar with other tokenizers.

Although I'm not sure this is needed if we just say built in tokens take precedence over custom tokens. In well formatted markdown every block token should be separated by a blank line. The only reason start is actually needed is for inline tokens.

UziTech commented 2 weeks ago

For example the katex extensions block tokenizer does not have a start function because we are expecting a blank line before it so even a paragraph takes precedence.

https://github.com/UziTech/marked-katex-extension/blob/main/src/index.js#L63

calculuschild commented 2 weeks ago

The way we interrupt paragraph is by clipping src when passing it into the tokenizer

I remember. I wrote that. 😜

In well formatted markdown every block token should be separated by a blank line.

Pretty markdown might, but the specs still make it clear that it is valid to place certain block tokens directly against each other. demo example

The only reason start is actually needed is for inline tokens.

Remember, we have separate handling for paragraphs and inline text. Paragraphs are clipped by block tokens https://github.com/markedjs/marked/blob/2124b5de1e37416cf60d568f9822c11ef3b2fc89/src/Lexer.ts#L237, and inline text is clipped by inline tokens https://github.com/markedjs/marked/blob/2124b5de1e37416cf60d568f9822c11ef3b2fc89/src/Lexer.ts#L436. They are both needed.

We could do something similar with other tokenizers.

If we did, I think it would only need to be tables and blockquotes to keep with the GFM spec. The other block tokens have a clear ending symbol (fences), or are allowed to just absorb the block tokens (lists). Maybe that's not too bad?

UziTech commented 2 weeks ago

Remember, we have separate handling for paragraphs and inline text. Paragraphs are clipped by block tokens. They are both needed.

The block tokenizer start function is not needed if you don't need to interrupt a paragraph. Paragraphs are automatically interrupted by blank lines.