mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
755 stars 138 forks source link

skip yaml (front matter)? #209

Open ec1oud opened 5 months ago

ec1oud commented 5 months ago

Markdown can have yaml headers (yeah it's not in commonmark, maybe not a good idea either... but it happened). I think maybe md4c should skip parsing of "front matter" but also make it available for further parsing with a yaml parser. I.e. if the first line of the file is just three hyphens, that's a special case, not a thematic break, and everything from there until the next line with three hyphens can be assumed to be front matter. (I guess someone was thinking that nobody starts a document with a thematic break, so it can be treated as a special case.) But I'm not sure where's the most-formal specification (to verify that it must begin on the first line, must be three hyphens, and so on).

https://docs.github.com/en/contributing/writing-for-github-docs/using-yaml-frontmatter https://jekyllrb.com/docs/front-matter/ https://github.com/readthedocs/commonmark.py/issues/208 https://assemble.io/docs/YAML-front-matter.html

ATM I'm trying to figure out what Qt should do with this, since it's currently a mess if you give such a document to https://doc-snapshots.qt.io/qt6-6.7/qml-qtquick-textdocument.html#source-prop I figured if md4c would give me that chunk, I can stash it away somehow so that QTextMarkdownWriter can re-write it verbatim when the file is saved; but Qt doesn't have a yaml parser, so an application that cares about that might need to subclass QTextDocument, retrieve the saved front-matter string, and parse it with https://github.com/jbeder/yaml-cpp or something like that. But QTD won't even skip it yet, so it's going to be a bit less convenient for now.

mity commented 5 months ago

I'm not familiar with Jekyll or its front matters or whatever other tools support it.

But if I understand it correctly, it's some sort of templating system and the document can contain references to variables defined in the front-matter. Also note the templating system itself seems to be more or less independent of the document format after the front-matter: E.g. you might have a static HTML page below the front-matter instead of Markdown.

Consequently it seems to me that such raw documents shouldn't really be passed to Markdown or HTML or whatever parser before the templating system does its job, i.e. replaces the variable references with their respective value, removes the front-matter and then passes the result to the actual document parser, here MD4C.

ec1oud commented 5 months ago

I think yaml front matter is turning into the standard way to add metadata, not only for templating systems. https://obsidian.md/ does that, and I'm trying to prototype a similar app with Qt.

But ok I will see if I can get Qt to deal with it ahead of time, for now. (just the front matter, not template replacement)

Turning a line of yaml into an H2 is just wrong.

---
birth: !timestamp '2024-01-09 19:58:17.324717822 -0700'
---
blah blah

becomes (with md2html)

<hr>
<h2>birth: !timestamp '2024-01-09 19:58:17.324717822 -0700'</h2>
<p>blah blah</p>
mity commented 5 months ago

I think yaml front matter is turning into the standard way to add metadata

I see. Then it might make sense. With the objection that no validation of the stuff between the two --- markers would be performed. Incorporating YAML parser into MD4C because of this seems as a heavy overkill to me.

What should MD4C with it?

Skipping it silently, propagating it as a code block (possibly with MD_BLOCK_CODE_DETAIL::info and/or MD_BLOCK_CODE_DETAIL::lang set to something sensible), or do we need some special MD_BLOCK_TYPE for it?

Also as it's not in vanilla CommonMark spec, it would likely be recognized only with some new parser option.

ec1oud commented 5 months ago

I'm not sure yet what the md4c block type should be. --- is not a normal code block fence, so probably we shouldn't confuse it with a code block. There could be yaml in a code block somewhere else in the document, which presumably would not have the same meaning.

Maybe other people will have comments on this in the meantime.

I'm playing with https://codereview.qt-project.org/c/qt/qtbase/+/529543 for now.