syntax-tree / mdast-util-from-markdown

mdast utility to parse markdown
MIT License
212 stars 20 forks source link

Preserve original content #31

Closed fgarcia closed 1 year ago

fgarcia commented 1 year ago

Initial checklist

Affected packages and versions

1.3.0

Link to runnable example

No response

Steps to reproduce

let before = 'one\n  two'
let mdast = fromMarkdown(before)
let after = toMarkdown(mdast)

Expected behavior

Normally I would expect from most AST to preserve the original content before any explicit manipulation.

In the code above I was counting on text before and after the conversion (Text -> AST -> Text ) to be the same. However the result above trims the indentation of the second line. I know that when Markdown is converted to HTML those spaces are ignored, but I would not expect the parser not to manipulate the original content in advance.

I expected before === after

Actual behavior

Currently before !== after

The value in after drops the empty spaces after the line break "one\ntwo"

Affected runtime and version

node@18.15

Affected package manager and version

No response

Affected OS and version

No response

Build and bundle tools

No response

wooorm commented 1 year ago

Hi!

Normally I would expect from most AST to preserve the original content before any explicit manipulation.

I don’t know of any AST tool that behaves as you describe. ASTs are by definition lossy. Their abstract.

Syntax trees come in two flavors:

  • concrete syntax trees: structures that represent every detail (such as white-space in white-space insensitive languages)
  • abstract syntax trees: structures that only represent details relating to the syntactic structure of code (such as ignoring whether a double or single quote was used in languages that support both, such as JavaScript).

https://github.com/syntax-tree/unist#syntax-tree


So this is impossible. You can find more by searching the organization: https://github.com/search?q=org%3Asyntax-tree+cst&type=issues. Here’s a search that looks through our other organizations too: https://github.com/search?q=CST+user%3Awooorm+org%3Amdx-js+org%3Amicromark+org%3Aremarkjs+org%3Arehypejs+org%3Aretextjs+org%3Avfile+org%3Asyntax-tree+org%3Aunifiedjs&type=issues.

github-actions[bot] commented 1 year ago

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

wooorm commented 1 year ago

You might also be running into an XY problem. See our support docs for more info: https://github.com/syntax-tree/.github/blob/main/support.md#asking-quality-questions. Perhaps you can share more about your actual problem: why do you need superfluous whitespace to exist?

fgarcia commented 1 year ago

I wanted to manipulate a Markdown file and modify keywords only in the section/header lines. It is very easy to do exploring the AST, but I started to notice when converting back that other parts of the document were affected too. Mostly I wanted to modify Markdown and write back to Markdown, not convert to HTML

In the past I did some small JS codemods manipulating the syntax tree and I was lucky never getting unexpected side effects, or even worst, maybe I just never noticed :worried:

wooorm commented 1 year ago

You shouldn’t see side effects that actually do something: the whitespace does nothing. If you see things that do affect something, let me know.

JS codemods

Codemods typically work differently. And you can do that with our tools too. They often don’t serialize an AST but change a string. You first need to figure out where things are: our AST gives you positional info for that. Then you can pass that info, and what you want to replace, to something like: https://github.com/Rich-Harris/magic-string.

remcohaszing commented 1 year ago

You may be interested to use remark-cli / remark-language-server / remark for VSCode in combination with unified-consistency.