micromark / common-markup-state-machine

CMSM: Common markup state machine
https://unifiedjs.com
48 stars 0 forks source link

Extensions #4

Closed wooorm closed 3 years ago

wooorm commented 4 years ago

This state machine is finite. Markdown, mostly annoyingly, but in some cases hugely useful (GFM, MDX) has extensions.

We can either a) define most useful extensions and hide them behind flags, b) support hooks for extensions to overwrite states and the like, c) figure out a way to allow backtracking and attempting a list of possibilities, or d) something else?

They all have downsides.

ChristianMurphy commented 4 years ago

a) This option doesn't seem sustainable. The remark maintainers would become gatekeepers for new syntax types. Not all syntax extra features have the same level of completeness and maturity, and representing this would become challenging. Any features that don't get merged, the authors only option would be to fork the entire parser.

b) It may not be an easy approach, but it seems like the most maintainable one. Finite State Machines (FSMs) are made up of states and transitions. As long as the hooks allow adding and removing, both states and transitions, new syntax layers should be doable as plugins.

A couple considerations this would raise.

  1. How should transitions to states that have been removed be handled?
    • One option would be a utility/linter that checks the connected-ness of the state graph
    • Other ideas?
  2. Would plugin dependencies be a concern that should be handled? (E.G. a syntax extension that depends on MDX states and transitions, which are themselves syntax extensions)
    • One option could be npm peer dependencies and notes in the documentation.
    • Other ideas?

c) Maybe, as discussed in https://github.com/micromark/micromark/issues/9 backtracking can easily lead to performance issues, which seems to go against the stated goal of Micromark.

wooorm commented 4 years ago

a) This option doesn't seem sustainable. The remark maintainers would become gatekeepers for new syntax types.

I would like to add that there are a couple of important syntax extensions: frontmatter, GFM, and MDX. Syntax extensions lead to different implementations, which leads to Markdown being less portable to other vendors, which is annoying! Markdown already has HTML as a place for extensions, and as unified we are also pushing MDX. Maybe standardising a couple of optional features and not allowing everything is a good thing for Markdown?

ChristianMurphy commented 4 years ago

Having some default extensions hidden behind flags could be fine. As long as there is a way to add syntax, that is not bundled with micromark core.

wooorm commented 4 years ago

Could you expand on why do you think that is important?

Other languages aren’t like this. In JS, CSS, or HTML it isn’t normal to do non-standard stuff (there are languages on top of them though, but those are implemented in new parsers)

ChristianMurphy commented 4 years ago

Could you expand on why do you think that is important?

Many remark plugins, including

Hook into the tokenizer to provide new syntax. I want these projects to be able to safely upgrade to the new micromark based remark.

Other languages aren’t like this. In JS, CSS, or HTML it isn’t normal to do non-standard stuff (there are languages on top of them though, but those are implemented in new parsers)

Depends on what you mean, there are tools for these languages that offer pluggable parsers, for example:

wooorm commented 4 years ago

Other languages:

Babel has syntax plugins but those pass an option the parser (essentially flow: true).

PostCSS has different parsers, that are different projects that transform to the PostCSS AST.


For the format: Extensions make the format not portable; I think this hold the markdown format back; I think that we are in a position to move Markdown forward.

For the current extensions API: it isn’t very nice, it feels hackish, the code for plugins looks a bit spaghetti/buggy too.

One interesting idea is Generic directives/plugins. TLDR:

Note that some things such as remark-breaks can be done on a CST.

Say we’d support frontmatter, GFM, MDX, and these generic extensions, are other things really needed?

So frontmatter would be specced/provided by us, shortcodes and attrs could be a generic extension, and last zmarkdown could be a fork (like how gfm is a fork of cmark)

ChristianMurphy commented 4 years ago

That could work, thanks for outlining your idea so clearly @wooorm! :bowing_man: @vhf and @djm this would most directly impact your projects, thoughts? :thought_balloon:

vhf commented 4 years ago

I see Markdown as something much easier to learn, to write, and less powerful than HTML. Custom syntax elements have the same benefit as opposed to mixing HTML into Markdown.

I like the Babel approach, and until now the Remark approach of writing plugins that can hook into any part of the parsing/compiling process, my preference would be to keep it that way. If the consensus is to go another direction I'll adapt though, I'm not the one doing the hard work on micromark. :)

Forking micromark would be an option for my projects, one of the cost of which (and you could see it as a benefit if custom syntax is holding Markdown back) is that we won't be able to create a new tool by cherry-picking a few libs and plugins and composing them together.

IIRC we have Gatsby as a sponsor and a few Gatsby contributors are also unified/remark/rehype contributors, unfortunately I don't know who to ping. I think their perspective on this would also be of interest, their project would be impacted (example) as well.

wooorm commented 4 years ago

Nobody said this yet, but I think it’s noteworthy to mention that I don’t see any way where current plugins that integrate with remark-parse, could work as-is with micromark. Even if micromark has extensions, they’d need to be rewritten entirely. This does not affect transformer plugins (mdast remains the same I think)


Thanks Victor!

I like the Babel approach

Do you have an example of how Babel allows custom syntax? What I found is that Babel has syntax plugins but those pass an option to the parser (essentially flow: true).

Oh and a question: could the whole zestedesavoir content be converted from its older custom syntax, to a new standard? We could have “codemods” that take remark-extension-markdown and port it to micromark-generic-directive-markdown?

I’ll post the Gatsby comment below so it’s a separate link

wooorm commented 4 years ago

Going through all gatsby-remark plugins in the Gatsby monorepo gives us gatsby-remark-katex and gatsby-remark-custom-blocks that integrate with remark-parse.

gatsby-remark-custom-blocks could be (should be?) changed to the generic directives syntax:

:::name[inline-content]{key=val}
contents, which are sometimes further block elements
:::

Math 🤔🤷‍♂️ It’s pretty common to support $foo$. Generic directives syntax would give :math[foo]. Could work as well?

Gatsby folks, what do you think about the remark ecosystem dropping support for any extensions? And instead, supporting only a couple (frontmatter, GFM, MDX, and generic directives)?

/cc @johno @ChristopherBiscardi @sidharthachatterjee

vhf commented 4 years ago

Do you have an example of how Babel allows custom syntax? What I found is that Babel has syntax plugins but those pass an option to the parser (essentially flow: true).

I don't sorry, disregard this comment as it's based on what I remember from contributing to Babel, I could be wrong about that and it was Babel 6.x, 4 years ago 😱

Oh and a question: could the whole zestedesavoir content be converted from its older custom syntax, to a new standard?

Unfortunately not, for two main reasons:

  1. 15k users have been relying on the current syntax for over 5 years (I personally wouldn't take on the task of convincing people to switch to a new syntax)
  2. there are currently 334k content written using the current syntax, for obvious reasons this content is stored twice in an RDBMS: once as Markdown, once as rendered HTML, converting the Markdown and rerendering the HTML would probably be time-consuming in and by itself, even more so because I'm pretty sure it will be error-prone (old content, weird glitches, etc).

I'd say we would either fork to reimplement the syntax, or stay on our current stack and possibly maintain whatever dependency becomes deprecated upstream if need be. What we currently have is pretty stable and I still see it as a viable option. :)

wooorm commented 3 years ago

These are now supported in micromark. I’ll add more on how they work in cmsm later.