Closed pi0 closed 5 months ago
the md4c parser is pretty simple, it just receives 5 hooks:
enter_block
: returns block type and details(ie, heading level)leave_block
: block endenter_span
: returns span type and details(ie, link href)leave_span
: span endtext
: inner textbasically we can create these hooks in host, although calling js functions in wasm module is not ideal, but yes i think i can do it. just don't know are these hooks enough for omark's goal?
I am thinking of the fastest method to resolve the traversed MD tree so omark can make a simplified interface on top of it.
We might try to benchmark two methods:
Please let me know if you like me to try or like to compare yourself 👍🏼
i perfer using construct tree, how about md to jsx-likes tree?
# Jobs
Stay _foolish_, stay **hungry**!
[https://apple.com](Apple)
<a href="https://apple.com">Apple</a>
[
{type: 'h1', children: ['Jobs']},
{type: 'p', children: [
'Stay ',
{type: "em", children: ["foolish"]},
', stay ',
{type: "strong", children: ["hungry"]},
'!',
{type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
{type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
]}
]
Honestly, for omark, I am considering a flattened array of streamable data (to make markdown ASTs as simple as possible) + and some alternative ways of nesting.
If you prefer a nested tree like other parsers there is no problem we can always convert 👍🏼
how the flattened array
looks like?
how about splitting by blocks? this should work as streamable data
--- chunk 1
{type: 'h1', children: ['Jobs']}
--- chunk 2
{type: 'p', children: [
'Stay ',
{type: "em", children: ["foolish"]},
', stay ',
{type: "strong", children: ["hungry"]},
'!',
{type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
{type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
]}
or use array instead of object:
--- chunk 1
['h1', ['Jobs']]
--- chunk 2
['p', [
'Stay ',
["em", ["foolish"]],
', stay ',
["strong", ["hungry"]],
'!',
['a', {href: 'https://apple.com'}, ['Apple']],
['html', {html: '<a href="https://apple.com">Apple</a>'}, []]
]]
Yes, exactly I am thinking about splitting by logical blocks. But tricky to represent (still thinking how). Mainly I am considering using a Proxy that can access each block either as a stringified value or to be traversed individually. (why? because many use cases of tools simply require the high level representation of markdown AST not details) Something like this:
[
"Jobs", // .{ type: 'h1', contents: <Proxy>[p:stay foolish..a:apple] }
"Stay foolish, stay hungry!", // .{ type: 'p', contents: <Proxy>[.stay, em: ...] }
"Apple" // .{ type: 'a', contents: <Proxy>[apple] }
]
I would love to together brainstorm on this possibility once there! I think for first step we need the parsed AST and I have high hopes to rely on md4w is promised before since it is native an minimal! If you are good with first proposal, https://github.com/ije/md4w/issues/3#issuecomment-1946257737 I think we can do it from there.
sounds cool! i will try to implement a mdToJson
function for a start.
I just made a quick wrapper that results (almost) same as your proposed object in omark so we can work in parallel.
The object is meant for internal purposes only and I can happily adjust to what you finally provide but also would love to have your 👍🏼 on https://github.com/unjs/omark/pull/15 if you have few minutes to check so we are safe to go.
thanks
@pi0 https://github.com/ije/md4w/pull/4 the first test has passed(not finished, can't handle the nesting blocks/spans yet)
Hi. I quickly made this tracker issue while writing https://github.com/unjs/automd/issues/32 to see if you are interested to also expose a simple parse util? (could be either stream or returning whole AST). This can be used as parser core in unjs/omark ❤️