syntax-tree / mdast

Markdown Abstract Syntax Tree format
https://unifiedjs.com
1.1k stars 45 forks source link

Add concrete syntax details #36

Closed CupOfTea696 closed 3 years ago

CupOfTea696 commented 3 years ago

Add concrete syntax details

It would be nice to have concrete syntax information on certain nodes, for example, which bullet type was used for a list item.

Problem

When using a markdown parser to modify markdown and write it back to a file, it would be nice to re-use the same style as the original markdown content. Currently, there is no way to get this information to either use inside a compiler or to set the compiler options.

Expected behaviour

Syntax details included in tree Nodes. Below an example for emphasis

Interface

interface Emphasis <: Parent {
  type: "emphasis"
  character: string?
  children: [TransparentContent]
}

Markdown:

*alpha* _bravo_

Yields:

{
  type: 'paragraph',
  children: [
    {
      type: 'emphasis',
      character: '*',
      children: [{type: 'text', value: 'alpha'}]
    },
    {type: 'text', value: ' '},
    {
      type: 'emphasis',
      character: '_',
      children: [{type: 'text', value: 'bravo'}]
    }
  ]
}

When recompiling the above tree back to Markdown, it would render back to *alpha* _bravo_ rather than *alpha* *bravo*, unless the compiler is explicitly set to use a certain character for emphasis.

Alternatives

This could be implemented without any compiler modifications by having a utility that detects the used syntax and sets the compiler's options accordingly.

ChristianMurphy commented 3 years ago

I'm not sure this makes sense in mdast specifically.

This document defines a format for representing Markdown as an abstract syntax tree

https://github.com/syntax-tree/mdast#introduction

abstract syntax tree, by design, encode structure, not syntax (which is what a Concrete Syntax Tree would do)

Micromark and CommonMark State Machine (CSM) could enable constructing a concrete syntax tree, and this is noted in the CSM readme:

complete, as it defines different types of tokens and how they are grouped, which allows the format to be represented as a concrete syntax tree

https://github.com/micromark/common-markup-state-machine/blob/0befbfa556fdba5559d35f8f365c2d50be301a1f/readme.md#1-background

This would likely need a new standard (mdcst?) to capture concrete syntax needs. Since transforms interested in structure (AST), and formatters interested in specific syntax (CST) will have different wants and needs.

ChristianMurphy commented 3 years ago

Also see previous discussion at https://github.com/syntax-tree/mdast-util-to-markdown/issues/3

wooorm commented 3 years ago

@CupOfTea696 If this is needed, you can also use the positional info to access that info by looking characters up in the corresponding vfile!

wooorm commented 3 years ago

some of this also discussed here https://github.com/remarkjs/remark/issues/32 and then https://github.com/remarkjs/remark/issues/132#issuecomment-229507086, when remark just made (and still called mdast).

Honestly, I feel that PostCSS and ESTree, which do patch this stuff on nodes, made a mistake: it makes the syntax tree hard to handle

wooorm commented 3 years ago

Some more past issues on all this: https://github.com/search?o=desc&q=CST+user%3Amicromark+user%3Aremarkjs+user%3Aunifiedjs+user%3Asyntax-tree&s=created&type=Issues

I’m closing this because I don’t thing such fields should be added to mdast nodes (by default: of course, it’s just json so you can do that yourself if you want). If/when there is a CST version of mdast, it will be a different project, and I’ll make sure to note it here!