Export compiler function

chrisjsewell commented 2 years ago

Initial checklist

[X] I read the support docs
[X] I read the contributing guide
[X] I agree to follow the code of conduct
[X] I searched issues and couldn’t find anything (or linked relevant results below)

Problem

Heya, I would like to directly import/use the compiler, on a pre-created list of events. I think this is currently not possible?

Obviously this is the key function provided by this package, then fromMarkdown is just a wrapper around it and the upstream postprocess/parse/postprocess functions (all importable)

Solution

Allow for e.g.

import {compiler} from 'mdast-util-from-markdown/lib/index'

compiler(options)(events)

I guess this just requires the addition of export function compiler..., and a small modification of package.json, ilke in micromark itself:

{
  "exports": {
    ".": {
      "development": "./dev/index.js",
      "default": "./index.js"
    },
    "./lib/index": {
      "development": "./dev/lib/index.js",
      "default": "./lib/index.js"
    },
    "./lib/index.js": {
      "development": "./dev/lib/index.js",
      "default": "./lib/index.js"
    }
  }
}

Alternatives

Don't think so

wooorm commented 2 years ago

What’s the reason you have events? compile and events are all rather “internal” and not “pretty”

chrisjsewell commented 2 years ago

To implement https://github.com/executablebooks/myst-spec, and replace our current markdown-it implementation: https://github.com/executablebooks/markdown-it-docutils, I need to be able to perform nested/incremental parsing:

// mdast nested parsing proof-of-principle
import {parse} from 'micromark/lib/parse'
import {postprocess} from 'micromark/lib/postprocess'
import {preprocess} from 'micromark/lib/preprocess'
import {compiler} from 'mdast-util-from-markdown/lib/index'

// Take the following example. The problem here is that:
// (a) we first want to do a top-level parse of the source file, not processing the directive
// (b) we then want to do a nested parse of the directive content,
//     but within the "context" of the top-level parse.
const content = `
Paragraph

\`\`\`{note}
[x]
\`\`\`

[x]: https://www.google.com
`

// This is where we would load MyST specific plugins and configuration
const options = {}
// This adapted parser allows us to pre-set the parsing context,
// for the starting position of the text (in the source file),
// and any previously parsed definition identifiers (for the definition lookup). 
function parseMarkdown(content, options, initialPosition, defined) {
    const parser = parse(options)
    parser.defined.push(...(defined || []))
    const events = postprocess(
        parser.document(initialPosition).write(preprocess()(content, 'utf8', true))
    )
    return {mdast: compiler(options)(events), defined: parser.defined}
}

// (a) first we perform the top-level parse
const {mdast, defined} = parseMarkdown(content, options)

// we then get the initial AST, and also any identifiers for definitions
console.log(mdast)
console.log(defined)

// ... some extra steps here would identify the directive,
// and give us its content and the content starting position
const nestedContent = `[x]`
const initialPosition = {line: 4, column: 1, offset: 0}

// If we did not provide the definition identifiers here then,
// by the CommonMark spec, the reference would simply be parsed as text. 
const {mdast: mdastNested} = parseMarkdown(nestedContent, options, initialPosition, defined)

Trust me, I know the "unprettiness" of Markdown parsing 😅, I'm also the author of https://github.com/executablebooks/markdown-it-py

Events and compilers are already documented as part of your core parsing architecture: https://github.com/micromark/micromark#architecture, so I would not necessarily say they are completely "internal" 😬

chrisjsewell commented 2 years ago

FYI, if we can get all this working, then we are hoping to utilise it as the core parsing architecture in products such as https://curvenote.com/, https://irydium.dev/ and https://github.com/agoose77/jupyterlab-markup 😄

wooorm commented 2 years ago


// Take the following example. The problem here is that:
// (a) we first want to do a top-level parse of the source file, not processing the directive
// (b) we then want to do a nested parse of the directive content,
//     but within the "context" of the top-level parse.

Can you expand on this? Markdown already allows for (a). What is the “context” you mean in (b)?

chrisjsewell commented 2 years ago

The context is:

Initialising the parse with the correct initial position, so that all the node positions point to their correct places in the source file. You could do this retroactively, in a post-processing step, but it's nicer to do in one parse
Initialising the parser with known definition/footnote identifiers. This is the key point really, because CommonMark only parses definition references of known definitions (otherwise treating them as plain text), you have to have this context of "found" definitions. It would be great if CommonMark, would just parse all [x] syntax as definition references, irrespective of what definitions are present, then allow the renderer to handle missing definitions, but such is life 😒.

wooorm commented 2 years ago

Why not integrate with micromark in an extension? Extensions parse their thing and they can annotate that some stuff inside them should be parsed next

https://github.com/micromark/micromark/blob/fc5e2d8b83eb9c01c9bfd2f4b1ea4e42e6a7e224/packages/micromark-util-types/index.js#L20

chrisjsewell commented 2 years ago

Why not integrate with micromark in an extension?

Possibly, but it then means that "everything" has to be parsed in a single parse, and makes things a lot less "modular" and incremental

the idea with these directives, is that you perform an initial parse, which just identifies the directives

```{note}
Internal *markdown*

more


which gets you to an intermediate AST

```xml
<directive name="note">
    Internal *markdown*
<directive name="note">
    more

Then you perform a subsequent parse, which processes the directives and gets you to your final AST:

<admonition type="note">
  <paragraph>
    <text>
        Internal
    <emphasis>
        <text>
           markdown
<admonition type="note">
  <paragraph>
    <text>
        more

This makes it a lot easier than having to do everything at the micromark "level"

wooorm commented 2 years ago

the thing is that with tracking position (one thing) but importantly all the definition identifier stuff, you’re replicating a lot of the work.

Also note that the positional info is not going to be 100% if you have mdast for fenced code, and then parse its result, because an funky “indent”/exdent is allowed:

https://spec.commonmark.org/dingus/?text=%20%20%20%60%60%60%7Bnote%7D%0A%20%20Internal%0A%20*markdown*%0Amore%0A%60%60%60

This makes it a lot easier than having to do everything at the micromark "level"

Uhhh, this post is about juggling micromark internals to not have to make a micromark extension? How is that easier? 🤔 I don‘t get it.

It sounds simpler to

copy/paste https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/code-fenced.js
look at https://github.com/micromark/micromark-extension-directive/blob/7e2f9384e6ccb35e1182e2471851918be37c3fd4/dev/lib/directive-container.js#L149-L152 to see how it handles internal containers

Then you perform a subsequent parse, which processes the directives and gets you to your final AST:

micromark already does that? It has it built in. Why do you need separate stages?

wooorm commented 2 years ago

How are you using “incremental”?

chrisjsewell commented 2 years ago

micromark already does that? It has it built in. Why do you need separate stages?

Hmmm, I feel I'm not explaining directives properly to you; processing directive content is not just about parsing, its about node generation. Directives need to be able to generate MDAST nodes, and these nodes do not necessarily relate directly to syntax in the source text.

Take the figure directive:

This:

```{figure} https://via.placeholder.com/150
This is the figure caption!

Something! A legend!?


needs to go to this:

```yaml
    title: Simple figure
    id: container
    mdast:
      type: root
      children:
        - type: directive
          kind: figure
          args: https://via.placeholder.com/150
          value: |-
            This is the figure caption!
            Something! A legend!?
          children:
            - type: container
              kind: figure
              children:
                - type: image
                  url: https://via.placeholder.com/150
                - type: caption
                  children:
                    - type: paragraph
                      children:
                        - type: text
                          value: This is the figure caption!
                - type: legend
                  children:
                    - type: paragraph
                      children:
                        - type: text
                          value: Something! A legend!?

How would you even go about getting a micromark extension to achieve this?

It is a lot easier to work at the MDAST node level than the micromark event level, when processing directives. But you do need to have a way to perform nested parsing.

This is exactly how docutils/sphinx directives work; you are generating nodes, and only performing nested parsing when necessary: https://github.com/live-clones/docutils/blob/6548b56d9ea9a3e101cd62cfcd727b6e9e8b7ab6/docutils/docutils/parsers/rst/directives/images.py#L146

chrisjsewell commented 2 years ago

FYI, I also know of https://github.com/micromark/micromark-extension-directive, but these directives are quite different, in that their content is "interpreted" text, i.e. it might not be Markdown.

Take for example csv-table: https://docutils.sourceforge.io/docs/ref/rst/directives.html#csv-table-1

```{csv-table}
:header: "Treat", "Quantity", "Description"
:widths: 15, 10, 30

"Albatross", 2.99, "On a stick!"
"Crunchy Frog", 1.49, "If we took the bones out, it wouldn't be crunchy, now would it?"
"Gannet Ripple", 1.99, "On a stick!"



Here, the content will be converted into table nodes, which is not something that can be done in a micromark extension.

wooorm commented 2 years ago

Thanks for expanding. I now understand the use case better, particularly why it’s a choice at the AST level, after the initial parse, to parse subdocuments.

I do find your earlier statements about wanting to reuse identifiers of “outer” definitions in these “inner” a bit weird. If they are really so separate and optional, it seems beneficial to have them “sandboxed” from the outer content, and in other words it seems to be at odds with your goal to reuse identifiers.

How would you even go about getting a micromark extension to achieve this?

I don’t see why not? micromark can parse that syntax. Though micromark is a level under mdast. So micromark would parse the syntax. A utility would turn the events into that tree.

It is a lot easier to work at the MDAST node level than the micromark event level, when processing directives. But you do need to have a way to perform nested parsing.

I am not suggesting to do the “Processing directives” part in micromark. As I understand it we both believe that that can happen in mdast. I am suggesting to “perform nested parsing” in micromark. Because markdown does “nested” already: micromark has this builtin.

Have you seen https://github.com/micromark/micromark#extending-markdown?
Where are you, on a scale from X to Y, between “we have a ton of content in the wild using this so we can’t change” and “we can still come up with new and improved ways”?

This issue is about compile, but you also mentioned:

Add options.startPoint support to micromark (and: how to even handle indents?)
How to pass “existing” identifiers? (and: how even to do that for extensions (footnotes))

How important are these to you? Are there other subissues you percieve?

chrisjsewell commented 2 years ago

Thanks for the response, I'll probably go in to more detail in due course; I'm still playing around with things, and nothing is necessarily set in stone, although:

Where are you, on a scale from X to Y, between “we have a ton of content in the wild using this so we can’t change” and “we can still come up with new and improved ways”?

Well there is already a fair amount of people using https://github.com/executablebooks/MyST-Parser and https://github.com/executablebooks/jupyter-book. So trying to change everything is not trivial, although I don't want this to completely block changes in design

Also, one of the key things, is imitating to some extent how sphinx/docutils already works (which is a pretty powerful document processor), but bringing that power to Markdown, plus making it a much more language agnostic specification (e.g. allowing us to use JS and unified, etc)

So fitting into how that works, and being able to re-use some of the general design is desirable

it seems beneficial to have them “sandboxed” from the outer content

It is a matter of modularization vs re-use; a common use-case may be wanting to re-use definitions specified at the top-level. For example:

[a]

```{note}
[a]

[a]


If parsing is sandboxed, this now forces users to re-define the definitions in every sandbox:

````markdown
[a]

```{note}
[a]
[a]: https://example.com

[a]
[a]: https://example.com


As discussed in https://github.com/executablebooks/myst-spec/issues/5#issuecomment-1072107254, I think "scoped" definitions may be the way to go

> I am suggesting to “perform nested parsing” in micromark. 

It is in some respects a matter of abstraction.
Directives need to be extensible, and users should be able to create their own directives, in as simple way as possible, and not be tied to one technology/implementation, e.g. like this pseudo-code:

```python
class MyDirective(Directive):
    def process(self):
        parent_node = self.create_node(type='x')
        child_node = self.create_node(type='y')
        parent_node.children.append(child_node)
        parsed_nodes = self.nested_parse(self.content)
        parent_node.children.extend(parsed_nodes)
        return [parent_node]

This type of abstraction can be written without knowledge of the underlying parsing technology, and different parsers can even be injected (docutils, micromark, markdown-it, ....)

Writing micromark extensions is obviously very specific and, honestly, way too complex for casual users.

I do see the power in using micromark directly though, to essentially create a Concrete Syntax Tree, with "built-in" positional referencing. Column level positional information is IMO the big selling point over markdown-it But how do you create an abstraction to essentially hide its use 🤔 You could, up-front, specify that the content needs to be parsed, as a special case, e.g.

class MyAdmonition(Directive):
      parse_content = True
      def process(self):
            parsed_nodes = self.parsed_content
            ...

But it gets tricky

How important are these to you? Are there other sub-issues you percieve? Add options.startPoint support to micromark (and: how to even handle indents?) How to pass “existing” identifiers? (and: how even to do that for extensions (footnotes))

Providing the starting point is less important, since this could probably be achieved with post-processing of the node tree

Identifiers are more important because that cannot be post-processed (the way Markdown works). For footnotes, it would be similar to definitions, but pre-setting parser.gfmFootnotes: https://github.com/micromark/micromark-extension-gfm-footnote/blob/adb67998d19b6d616064e1801bef95fe093647ba/dev/lib/syntax.js#L55

wooorm commented 2 years ago

If parsing is sandboxed, this now forces users to re-define the definitions in every sandbox:

Assuming that you have normal people writing markdown, and given that most normal people do not know about references/definitions, or don’t use them, you’ll likely be fine if these are sandboxed!

One more example to think about for you:

[a]: 1

```{note}
[a]: 2

[a]



The result of 2 here would likely make the most sense. That only occurs in a sandbox. Because CM otherwise dictates that the first definition “wins”.

> Directives need to be extensible, and users should be able to create their own directives, in as simple way as possible, and not be tied to one technology/implementation [...]

I would recommend using the other directives then: the ones from `remark-directive`. They are supported in more parsers. They are (hopefully) on track of being supported in more places. They have the big benefit that they can be parsed without understanding which extensions are enabled.
With your directives, there has to be some definition that `note` contains markdown and `table` contains csv data.
With those directives, any parser know that directives contain more markdown (and that code inside directive is data).

> Providing the starting point is less important, since this could probably be achieved with post-processing of the node tree

Regardless of whether it’s hard or easy, would you need it?

chrisjsewell commented 2 years ago

FYI, I'm getting somewhere 😅 https://github.com/executablebooks/unified-myst Feel free to give any pointers, if you see I'm doing anything drastically wrong

The result of 2 here would likely make the most sense. That only occurs in a sandbox. Because CM otherwise dictates that the first definition “wins”.

Cheers, yep certainly something to consider

I would recommend using the other directives then: the ones from remark-directive.

Ugghh, no; they are just a very different concept to docutils directives, they really do not support the aims of MyST, and more importantly, would be impossible to reconcile with https://github.com/executablebooks/MyST-Parser, i.e. as a plugin to Sphinx

They are (hopefully) on track of being supported in more places.

Well, they have been talked about for 8 years, and I haven't seen much of them yet 😬: https://talk.commonmark.org/t/generic-directives-plugins-syntax/444

They have the big benefit that they can be parsed without understanding which extensions are enabled.

This is also their biggest limitation, because basically they are just wrappers around blocks of Markdown, and the content cannot be anything else

Regardless of whether it’s hard or easy, would you need it?

Ideally yes

chrisjsewell commented 2 years ago

I'd note, I'm not completely closing the door on remark-directive, but it will definitely not fulfil the full MyST requirements. There may be some hybrid solution, but obviously I'm trying to balance that with trying to not introduce too much new syntax 😅

wooorm commented 2 years ago

I'd note, I'm not completely closing the door on remark-directive, but it will definitely not fulfil the full MyST requirements. There may be some hybrid solution, but obviously I'm trying to balance that with trying to not introduce too much new syntax 😅

If you have (legacy) reasons for requiring a custom arbitrary MDX extension, then I understand that you have to do what you have to do. If you have a choice to choose a syntax for arbitrary markdown extensions, then please take something that exists and has some support in different projects already (directives or MDX).

With micromark/remark/unified I want to push for a world where markdown is more interoperable. “Blessing” directives and trying to get other folks to use it is part of that.

Well, they have been talked about for 8 years, and I haven't seen much of them yet

They are talked about as long as CM exists. I don’t see why that matters. Most parsers support them or an issue is open. More people using and implementing them helps!

This is also their biggest limitation, because basically they are just wrappers around blocks of Markdown, and the content cannot be anything else

🤷‍♂️ I call it a feature, not a bug. Markdown has fenced code for data. Both fenced code and generic directives can be used. It’s even possible to, in the code that handles some “funky-table” extension, look for a generic directive with some name and look for the code inside it!

wooorm commented 2 years ago

^-- was the meta-level conversation.

At a micro-level, what do you need? Do you still need compile?

chrisjsewell commented 2 years ago

At a micro-level, what do you need? Do you still need compile?

yes please 🙏

^-- was the meta-level conversation. With micromark/remark/unified I want to push for a world where markdown is more interoperable. “Blessing” directives and trying to get other folks to use it is part of that.

Thanks for all your insight! Oh, I absolutely agree with the push for interoperability, and would love to contribute to that 👍

The hard thought, to my mind, is always how to balance usability, interoperability (which usually requires a tight specification) and flexibility/extensibility.

As I mentioning already, MyST is essentially adapting docutils/restructuredtext/sphinx, which is pretty nice in the amount of extensibility it offers and, because of that has a good ecosystem: https://pypi.org/search/?q=&o=&c=Topic+%3A%3A+Documentation+%3A%3A+Sphinx But, when it comes to interoperability, it's absolutely horrible, and there is reason that Markdown is vastly more popular than RST

Both fenced code and generic directives can be used.

For sure, this is possibility the route to go, although yeh trying not to not introduce too much syntax

wooorm commented 2 years ago

Reading your initial issue again, I do not see a reason to use compiler other than to use private APIs (parser.defined and initialPosition). Is it correct that you only need compiler to hook into private APIs?

If so, I do not feel much for exposing private hidden APIs and would want to move our discussion instead to, if possible, expose those APIs.

chrisjsewell commented 2 years ago

Is it correct that you only need compiler to hook into private APIs?

yep I guess so (although I still feel the compiler should not be private)

Parsing initialPosition 👍

For parser.defined, it would also require at a minimum parser.gfmFootnotes, plus I would need to both supply them and retrieve them.

For this API, I would suggest it might be beneficial to make a slight change 😬:

Rather than setting this parsing state directly on the ParserContext, it would be better to have a dict where they can be added, e.g.

// instead of
parser.defined = []
parser.gfmFootnotes = []
//have
parser.env = {
   defined: [],
   gfmFootnotes: []
}

This is, in fact, basically what Markdown-It does: https://github.com/markdown-it/markdown-it/blob/6da01033aa6ea2892e16a44672431fad3aff37b2/lib/index.js#L531-L533, and allows you to (optionally) parse the env, which is modified in place. The env can be used by any plugin, to store parsing state.

Take this example with markdown-it-py:

from markdown_it import MarkdownIt
from mdit_py_plugins.footnote import footnote_plugin

md = MarkdownIt().use(footnote_plugin)
env = {}
md.parse("""
[a]: https://www.example.com
[^b]: This is a footnote
""", env)
print(env)
md.render("[a] [^b]", env)

gives:

{'references': {'A': {'title': '', 'href': '[https://www.example.com]()', 'map': [1, 2]}}, 'footnotes': {'refs': {':b': -1}}}
<p><a href="[https://www.example.com]()">a</a> <sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup></p>

So for here, that would look something like:

import {fromMarkdown} from 'mdast-util-from-markdown'
const options = {}
const env = {}
fromMarkdown("content", options, env)

wooorm commented 2 years ago

Thanks for your proposal!

For parser.defined, it would also require at a minimum parser.gfmFootnotes, plus I would need to both supply them and retrieve them.

I do not believe it makes sense to support parser.gfmFootnotes in micromark, as micromark does not support GFM. GFM is an extension. Micromark and extensions have options. So we can probably use options?

plus I would need to both supply them and retrieve them.

You’re getting an AST. So I believe you can already retrieve them? 🤔 What else do you need?

The env can be used by any plugin, to store parsing state.

I’d rather not implement some mutable state of undocumented APIs to solve other undocumented APIs, so I’d rather not implement env / env.references from markdown-it, I’d rather introduce documented APIs (options).

Another problem arises with what you mentioned in this issue and how it’s solved with parser.defined/parser.gfmFootnotes. Those are lists of strings of actually defined things. You want to pass identifiers for things that are not defined. This will crash micromark:

https://github.com/micromark/micromark/blob/fc5e2d8b83eb9c01c9bfd2f4b1ea4e42e6a7e224/packages/micromark/dev/lib/compile.js#L709

…a micromark option that doesn’t work, that’s no good 🤔

For footnotes, IDs are enough. But for links, it would need to be IDs mapping to URLs and titles

chrisjsewell commented 2 years ago

so I’d rather not implement env / env.references from markdown-it, I’d rather introduce documented APIs (options).

I would just note that markdown-it also has options. options and env are two entirely different things: the first is immutable and set before parsing, the second is mutable and mutated during parsing

chrisjsewell commented 2 years ago

More conceptually, I feel that the documentation of parsing/processing in the unified Markdown space, is really obscurated, and it is hard to follow what packages do what (micromark, remark, mdast-util, ...), and in turn how to inject plugins into the process.

Firstly, I feel a simple diagram like this would help to clear things up:

Personally, I find diagrams far easier to consume than lots of text, and also an additional diagram I would have found useful, is for how the tokenizer works, e.g.

Secondly, I feel it might be helpful to split the packages up a bit differently, so that they handle separate concerns. This is the current packages:

I would suggest that this would be a better split:

Relating back to the original issue, a key difference between mdast-util-from-markdown and mdtoken-to-mdast, is that it would not be "concerned" with where the tokens come from, as long as they adhere to a certain format, i.e. it would indeed just expose the compiler (or something similar)

Having mdast-util-from-markdown enforce that the only way you can tokenize the source is to use micromark (in a certain way), seems like a not ideal separation of concerns 😬

wooorm commented 2 years ago

Please open other discussions in unified, remark, or mdast on how you believe documentation can be improved, and also definitely open discussions in micromark on which diagrams you’d find useful where, that’s very much appreciated. I think the diagram you show is, albeit a bit more verbosely, shown in the micromark readme: https://github.com/micromark/micromark#architecture.
It is intentional that HTML and mdast are the only two formats. I don’t see a reason for manually creating tokens and events outside of micromark (events literally contain micromark internals), and hence I don’t see a reason for why compile needs to be exposed.

Your diagram includes a CST. We don’t have a CST yet. We have an AST. micromark was in created in part to make CSTs possible, but the tokens are not that.

The division in your diagram is quite interesting, where the CST is in the middle, and that can be turned by different tools into other things, as we have such a division in unified already: the AST.

chrisjsewell commented 2 years ago

The division in your diagram is quite interesting, where the CST is in the middle

Setting aside whether the tokens are CST or not, this is actually what happens though no? You always have to go from source text -> tokens -> ...

and that can be turned by different tools into other things, as we have such a division in unified already: the AST.

But then this abstraction is essentially broken by having an HTML compiler in micromark. If you are saying that you should always go via the AST, then shouldn't micromark also comply with this?

chrisjsewell commented 2 years ago

events literally contain micromark internals

But then it feels strange that you basically have to use them, to create a micromark plugin. It seems like you are saying that creating a micromark plugin should not be possible, because it kind of only uses private APIs

But anyhow, thanks again for your time! I think I've discussed all I can about my use case at this point

chrisjsewell commented 2 years ago

actually one more thing 😅

micromark was in created in part to make CSTs possible, but the tokens are not that.

Out of interest, why would you say tokens are not a CST?

wooorm commented 2 years ago

Setting aside whether the tokens are CST or not, this is actually what happens though no? You always have to go from source text -> tokens -> ...

I am commenting on the duality of a) the boxes in the diagram, whose steps can indeed be considered correct, b) the circles drawn and the request that lies within them, which we kinda support in unified, when considering syntax trees as the middle.

But then this abstraction is essentially broken by having an HTML compiler in micromark. If you are saying that you should always go via the AST, then shouldn't micromark also comply with this?

It is by design that micromark is useable as-is. a) things are easier to maintain and particularly test if they can produce an output b) there is group of humans served with a tool that takes markdown and results in HTML (an alternative to marked), unified is a bit complex, serving more extensive goals

But then it feels strange that you basically have to use them, to create a micromark plugin.

You will indeed need to use the internal micromark APIs to create micromark extensions (note that we use the term extensions at the micromark level, compared to plugins at the unified level). Conceptually, I think it’s important to note:

Extensions stay inside micromark, and while parsing markdown is horrible and thus the internals exposed to extensions, generally there are documented (public) APIs to handle them. For example, micromark-extension-gfm-autolink-literal here is very clean: https://github.com/micromark/micromark-extension-gfm-autolink-literal/blob/main/dev/lib/syntax.js. A couple documented APIs, nothing weird. Would it be nice to improve these things in the future? Definitely. But if we now expose them, we can’t change them then.
micromark exists 1. to create HTML, 2. to create mdast, (potential future 3. to create an mdcst), these targets are limited to unified maintainers and thus don’t need to be covered by SemVer, all targets pull tokens/events from micromark into them, they don’t push tokens/events from somewhere to someplace

You are asking to expose internals outside of micromark and to push tokens/events into projects that aren’t meant for that. Your problem can be solved in several other clean ways, which I’d rather do.

Finally, I also do what I can to prevent folks from making extensions. I’m not a fan of extending the syntax of markdown.

Out of interest, why would you say tokens are not a CST?

They’re not a tree structure (I have called them “concrete tokens” though), although the event system get close to that
These tokens contain a bunch of private and hidden stuff, an AST would be a public and documented format. E.g., to get the text is “included” in a token, the tokenizer has to be used

Also:

I’d personally want a CST to be plain vanilla JSON instead of having to use function calls to access or change things
I would prefer to have a design that is compatible to mdast, as in, some format that’s both AST and CST, though this might not be possible

Feel free to open issues/PRs about:

docs
options.initialPosition to micromark
options.definitions (name to be discussed) in micromark
options.definitionIdentifiers (name to be discussed) in micromark-extension-gfm-footnote
Other more targeted issues

Also feel free to keep on discussing here! Closing, as I don’t think it’s wise to expose compile based on these current arguments.

chrisjsewell commented 2 years ago

Also feel free to keep on discussing here!

Yeh no worries

unicornware commented 7 months ago

@wooorm

would you be open to adding options.from so that it can be passed to document?

from can be passed to createTokenizer when working with micromark, but because the compiler function is not exported, i cannot make any use of the option without reimplementing the compiler myself.

export function fromMarkdown(value, encoding, options) {
  if (typeof encoding !== 'string') {
    options = encoding
    encoding = undefined
  }

  return compiler(options)(
    postprocess(
      parse(options)
        // .document()
        .document(options.from)
        .write(preprocess()(value, encoding, true))
    )
  )
}

wooorm commented 7 months ago

Hi Lex! Uhm, maybe, maybe not? Sounds like you want to increment positional info. I could see that not work the way you want. Can you elaborate more on your use case?

The reason I think it will not work, is that there are probably multiple gaps.

/**
 * Some *markdown
 * more* markdown.
 */

There’s a gap before more too. A similar problem occurs in MDX, where the embedded JS expressions can have markdown prefixes:

> <Math value={1 +
> 2} />

A better solution might be around https://github.com/vfile/vfile-location, similar to https://github.com/vfile/vfile-location/issues/14, and the “stops” in mdxjs-rs: https://github.com/wooorm/markdown-rs/blob/60db8e5896be05d23137a6bdb806e63519171f9e/src/util/mdx_collect.rs#L24.

unicornware commented 4 months ago

@wooorm

i'm not sure i understand your example :sweat_smile:

i'm working on an ast for docblocks that supports markdown in comments, so mdast node positions need to be relative to my comment nodes.

i ended up using transforms to apply my positioning logic, but feel it to be quite messy. based on some soft "tests", options.from would be more ideal

wooorm commented 4 months ago

There are several gaps. from only gives info for the start of the first line. There are multiple lines. If you want what you want, you’d need multiple froms. That doesn‘t exist. I don’t think this does what you want.

from is this place:

/**
 * |Some *markdown
 * more* markdown.
 */

Here your positional info is out of date again:

/**
 * Some *markdown
| * more* markdown.
 */

I recommend taking more time with my previous comment. Trying to grasp what it says. I think I describes the problem well, for your case, but also for MDX, and then shows how it is solved for MDX, which is what I believe you need to do too.

unicornware commented 4 months ago

@wooorm

oh i see, but i actually do want the initial from so the root node doesn't start at 1:1. i already have the logic to account for comment delimiters, if thats what you meant by gaps/multiple froms.

wooorm commented 4 months ago

My point is that you want that and more. Having just that is not enough for you.

wooorm commented 4 months ago

Please try to patch-package this issue, or edit it in your node_modules locally, and check if that works for you? I don't think it will.

unicornware commented 4 months ago

@wooorm

i think that is where our disconnect is. i know options.from isn't enough by itself, but it would be useful for markdown "chunks" spanning one line (i.e. a one line description) because no shifting is needed. for chunks spanning more than one line, options.from is useful so i can start my calculations from the given start point instead of 1:1.

i came to this conclusion because my soft "tests" included editing node_modules locally, lol.

wooorm commented 4 months ago

It could theoretically be useful for a hypothetical human. I’m not interested in adding things that might be useful to someone in the future, as I find that often, that future user practically wants something else.

Meanwhile, I believe you are helped with https://github.com/vfile/vfile-location/issues/14 and stops from mdx_collect.

unicornware commented 4 months ago

@wooorm

is that your suggested approach for pure markdown snippets as well?

additionally, from what i see, that issue is about max line length, which isn't what i'm looking for.

wooorm commented 4 months ago

That depends, is this use case a problem you have? From what you said before, I grasp that you don‘t have that problem or need that solution.

That issue is a feature request for a feature. It was brought up for a particular lint rule. That lint rule deals with line length. There are other lint rules. There is also your case, which is helped by that issue. Please though, read not just the link, but also the rest of what I mentioned:

A better solution might be around vfile/vfile-location, similar to https://github.com/vfile/vfile-location/issues/14, and the “stops” in mdxjs-rs: wooorm/markdown-rs@60db8e5/src/util/mdx_collect.rs#L24.

syntax-tree / mdast-util-from-markdown