Inconsistent with Commonmark Spec

ikatyang commented 6 years ago

Thanks for the awesome package, I'm able to implement markdown support for prettier using remark-parse (prettier/prettier#2943).

While I was implementing the pretty printer, I found there are some cases that parsed incorrectly according to the CommonMark spec 0.25:

(The commonmark option is linked to Commonmark Spec 0.25, so I guess it's based on 0.25?)

wooorm commented 6 years ago

Yes, that’s unfortunately correct. There’s a few cases where CommonMark differs.

remark-html tests for CommonMark compliance, but skips failing tests. You can run its tests to see the differences.

Through the time CommonMark was developed, a lot has changed, so more problems arise and I haven’t been able to keep up. Initially, when I added CommonMark support, complete CommonMark compatibility wasn’t a goal (because CommonMark wasn‘t all that common)

Commonmark is still in beta (not having a major semver yet), semver states:

Major version zero (0.y.z) is for initial development. Anything may change at any time. The public API should not be considered stable.

I’d like to add 100% CommonMark compat though, but that involves a rewrite of the parser, and that takes a lot of time. I envision in the future supporting just common mark with remark-parse, and moving the GitHub extension (now it moved to CommonMark as a base) to another project (remark-parse-github, remark-gfm?)

Anyway, I don’t have the bandwidth to do it myself currently, but I’d like assist anyone who’s interested in attempting it!

geyang commented 6 years ago

@wooorm Also there are quite a few CommonMark decisions that makes it hard to treat it as a AST. It would totally make sense to start another standard called "GoodMark" that makes CommonMark more regular and less contexual, and have a standard interpolation with embedded HTML, jsx, LaTeX and other languages.

For example, CommonMark's handling of html tags is quite annoying. It doesn't treat markdown and html as a tree, but segments of html string that is later strung together and then parsed by the html parser. This means <pre> tags and <div> tags are treated differently, and text in-between html tags are sometimes parsed as markdown but sometimes not.

mb21 commented 5 years ago

+1 for Commonmark compliance!

To counter @episodeyang comment, since it's got a few upvotes:

there are quite a few CommonMark decisions that makes it hard to treat it as a AST.

uh, no? copying from https://spec.commonmark.org/0.29/#about-this-document

this document describes how Markdown is to be parsed into an abstract syntax tree

It would totally make sense to start another standard called "GoodMark" that makes CommonMark more regular and less contexual

well, then it wouldn't have a lot in common with markdown anymore though. I agree that markdown is not the easiest to parse language... so yes, we could all just switch to RST or something, but that's not the point of this discussion.

have a standard interpolation with embedded HTML, jsx, LaTeX and other languages

There are definitely markdown parsers that have extensions that do that very well, for example https://pandoc.org/MANUAL.html#generic-raw-attribute

For example, CommonMark's handling of html tags is quite annoying.

I concede that raw HTML inside markdown sometimes parses to surprising and weird results. But that's always been the case with markdown, in all markdown parsers, and the point of commonmark is exactly that at least different parsers could agree on which weird way. It's quite tricky to come up with a solution that works for most of the markdown out in the wild.

It doesn't treat markdown and html as a tree, but segments of html string that is later strung together and then parsed by the html parser.

Commonmark conceptually does parse markdown to a tree (see the quote above about the AST), although implementations may choose to not materialize that tree. And yes, that tree does not include all the HTML elements as specific nodes (otherwise, markdown would have to be a superset of the entire HTML specification), but instead has a node type raw HTML block.

wooorm commented 4 years ago

Heya, just wanted to give an update about micromark, it’s sort-of a new motor that we’ll soon use in remark to parse markdown. It’s not yet 100% ready but will be relatively soon. The good news is, it fixes this issue! (P.S. see this twitter thread for some more info!)

geyang commented 4 years ago

I have switched from physics to machine learning. So hopefully next time we discuss this, I will be training a sequence model that reads the CommonMark spec, and automatically induces this parser :)

wooorm commented 3 years ago

Sorry for the wait! I just wanted to share that there’s now a PR that solves this issue: https://github.com/remarkjs/remark/pull/536.

wooorm commented 3 years ago

This is now released in remark@13.0.0

remarkjs / remark

Inconsistent with Commonmark Spec #306