syntax-tree / mdast-util-to-markdown

mdast utility to serialize markdown
http://unifiedjs.com
MIT License
100 stars 20 forks source link

Removable unsafe entries #31

Closed stevemk14ebr closed 3 years ago

stevemk14ebr commented 3 years ago

The options for unsafe does not allow the removal of unsafe entries. I'd like to customize this array but I cannot since .concat is used so passing an empty array does not remove any rules:

https://github.com/syntax-tree/mdast-util-to-markdown/blob/f3df7410049ed426ef8734ec762a38aa2feee73f/lib/configure.js#L18

I want to remove these rules in particular: https://github.com/syntax-tree/mdast-util-to-markdown/blob/f3df7410049ed426ef8734ec762a38aa2feee73f/lib/unsafe.js#L92-L94

wooorm commented 3 years ago

Why do you want to remove those three?

stevemk14ebr commented 3 years ago

Jira nonsense again:

function horizontalRule(node) {
    return { type: 'text', value: '----' }
}

emits:

\----

I'd have to duplicate the whole stringify crate if this isn't customizable :/

wooorm commented 3 years ago

Why use a text node and not return a string? https://github.com/syntax-tree/mdast-util-to-markdown/blob/f3df7410049ed426ef8734ec762a38aa2feee73f/lib/handle/thematic-break.js#L7

stevemk14ebr commented 3 years ago

thanks for the pointer. I was doing these changes in a custom remarkPlugin before which was before the stringify stage so things were escaped. I'm now also adding custom handlers in the stringify to handle the horizontalRule type where i return string exactly.

stringify handler:

[RemarkStringify, {
            handlers: {
                horizontalRule: (node, parent, context) => {
                    return node.value;
                },
            }
        }]

remark-jira

function horizontalRule(node) {
    return { type: 'horizontalRule', value: '----' }
}
wooorm commented 3 years ago

I think your particular use case is better solved by https://github.com/remarkjs/remark/issues/720#issuecomment-849714201 (the other issue). I might be open to revisit this in the future, but there are quite good reasons for these escape patterns to exist: to make sure markdown can roundtrip

stevemk14ebr commented 3 years ago

agreed

about-code commented 3 years ago

I sometimes face a similar situation like @stevemk14ebr. That's my use case:

In short glossarify-md is a tool which you feed Markdown in to get some enhanced Markdown out. As such it might be embedded into some Markdown toolchain. The tool itself is fine with CommonMark and GFM syntax to do its magic. Yet I would like the tool to be as tolerant as possible towards all the other Markdown syntax extensions that exist out there and which users might use in their input files.

Currently, there are two things which can happen when parsing non-standard Markdown:

  1. some syntactic elements get parsed incorrectly into a CommonMark node type due to an overlap/collision of the syntax extension with CommonMark (see Frontmatters with trailing --- dashes, for example, which overlap with "Headings")
  2. some elements get parsed into an mdAst text node due to not being recognized as syntax at all (see comment above)

In terms of 1 there's hardly anything - even the best parser can do apart from offering a great plug-in API. However, often the second happens. The mere fact that a text node is being created - rather than throwing an error - is quite nice. For a markdown-to-markdown tool only the escaping when stringifying the text node can be an unfortunate mutation of the user's inputs and break the user's MD toolchain.

Said this, I consider it absolutely the right choice for mdast-util-to-string to operate on the premise that mdast-util-to-markdown has created a semantically sound AST where a text node means regular text. Therefore by default it should definitely escape text content and maintain semantic interoperability, that is, make sure any subsequent parser of the stringified output reads it as regular text again. The issue in the face of an MD pipeline and syntax extensions is that mdast-util-to-markdown's interpretation as regular text isn't semantically correct in that context even though it is the only possible for it.

As a consequence the premise of mdast-util-to-string doesn't hold true anymore as well. A text node might not hold a value which represents regular text but unknown markdown syntax which should not be escaped. There could be a another parser in the pipeline capable of understanding the syntax.

The reason why I do not push for reopening this currently is, because there are no user-reported issues for glossarify-md ATM and it is possible for them to enjoy the remark-plugin ecosystem. Still, the space of MD syntax extensions is so broad that I fear even the remark ecosystem doesn't provide a plug-in for each. That's why I would be glad if I could point users to a remark setting by which they could disable escaping. Maybe I would even make glossarify-md to disable it by default.

ChristianMurphy commented 3 years ago

The issue in the face of an MD pipeline and syntax extensions is that mdast-util-to-markdown's interpretation as regular text isn't semantically correct in that context even though it is the only possible for it.

How so? Syntax extensions will have their own node types and stringifiers, and don't influence the interpretation of text.

about-code commented 3 years ago

Syntax extensions will have their own node types and stringifiers, and don't influence the interpretation of text

That's only the case if there's a corresponding remark plug-in with support for the syntax. The plug-in would create the node type and provide the stringifiers, wouldn't it? In case of glossarify-md it means (in some cases) that users have to install and configure a plug-in just to get out unmodified what they've passed in. Sometimes this weren't necessary if escaping could just be disabled and unknown syntax would be simply considered text (as it is, obviously if there's no plug-in which creates the node types) but then written as text again - yet unescaped text, though.

about-code commented 3 years ago

Here's an example of what I had to do specifically to support vuepress' [[toc]] instruction in glossarify-md to workaround escaping: https://github.com/about-code/glossarify-md/blob/c99d5c67fc022311109d84b6857f498d30982cd6/lib/ast/with/toc-instruction.js

This is actually similar to writing a particular plug-in but also a case where I couldn't find one that fits exactly, so had to write it (partly) myself. This isn't an approach that scales to the multitude of markdown flavours out there, I am afraid.

ChristianMurphy commented 3 years ago

That's only the case if there's a corresponding remark plug-in with support for the syntax. The plug-in would create the node type and provide the stringifiers, wouldn't it?

Yes, syntax extensions generally need both a parser and stringifier.

In case of glossarify-md it means (in some cases) that users have to install and configure a plug-in just to get out unmodified what they've passed in

Right, remark is an AST, not a CST. Structure (as parsed) is preserved, exact formatting is not (more context in https://github.com/syntax-tree/mdast-util-to-markdown/issues/3)

about-code commented 3 years ago

Thanks for the quick reponse and pointer on #3. Even though it is not a CST (wasn't aware there's a difference like AST/CST) is there anything that prevents making escaping an opt-out feature?

ChristianMurphy commented 3 years ago

Disabling escapes may (or may not) preserve custom markdown syntax, but it will break valid commonmark/gfm markdown in places where escapes should happen. It trades one problem for another, without solving the core issue you seem to be getting at, which is preserving exact format.

about-code commented 3 years ago

Thank you very much for revisiting the decision. Can go along with it.

mark-tate commented 2 years ago

This is a great package and I have enjoyed using it. However, I have had to work around this limitation.

My particular use-case, was that we are using this code as a 'codemod' for the original Markdown/MDX, this is part of migrating our docs from one format of MDX to another. We do not want to add escape characters at this point.

We want to preserve the format of the original code, without any additional escaping, so we can compare the old and new and ensure only the codemods have had an impact on the original docs.

I would agree that breaking valid commonmark/gfm markdown is undesirable but we can move the application of escaping unsafe characters elsewhere. Most of our authors are non-technical, so asking them to understand when to escape characters in documents, is not going to happen, instead, we can automate this step, as part of the final build/render service.

I worked around this using patch-package to prevent unsafe being passed as an array to configure. This is slightly cumbersome, the implementation of 'unsafe' uses an array and spread to merge them together, so it's possible to add additional characters but not replace or remove them. In an ideal situation, we would be able to override this option more cleanly without using patch package.

Has anyone else, faced a similar issue, how have you worked around things ?

wooorm commented 2 years ago

Most of our authors are non-technical, so asking them to understand when to escape characters in documents, is not going to happen, instead, we can automate this step, as part of the final build/render service.

I would strongly recommend that you do not force humans to write JSX if they can’t be taught to read error messages and escape characters in certain cases.

This is slightly cumbersome

Not just “cumbersome”, your approach is broken. Please read through the above conversation.

In an ideal situation, we would be able to override this

No, breaking markdown and MDX constructs is not ideal.


If your problem is migrating MDX 1 to MDX 2, see https://mdxjs.com/migrating/v2/ and https://github.com/hashicorp/web-platform-packages/pull/8. You can also link authors to the MDX website, e.g., this article: https://mdxjs.com/docs/troubleshooting-mdx/. Perhaps it’s also better to ask specific questions in the MDX org.

ocavue commented 1 year ago

I'm using rehype-remark to turn HTML into Markdown. I found that it produce some unnecessary escape.

Input HTML:

<div>
    <h1>Phone Number</h1>
    <p>+1-1234567890</p>
    <h1>Task list</h1>
    <ul>
        <li>
            <input type="checkbox" checked="true">Task list item
        </li>
    </ul>
</div>

Output Markdown:

# Phone Number

\+1-1234567890

# Task list

* \[x] Task list item
The source of my code ```ts import {unified} from 'unified' import rehypeParse from 'rehype-parse' import rehypeRemark from 'rehype-remark' import remarkStringify from 'remark-stringify' main() async function main() { const html = `

Phone Number

+1-1234567890

Task list

  • Task list item
` const file = await unified() .use(rehypeParse) .use(rehypeRemark, {checked: '[x] ', unchecked: '[ ] ', }) .use(remarkStringify, {listItemIndent: "one"}) .process(html) console.log(String(file)) } ```

Notice that the two \ in the output are unnecessary (e.g. removing the \ before + won't make it become a markdown unordered list item). I know that CommonMark allow unnecessary escape, however, these \ will confuse whoever don't understand Markdown/escape. After all, one of the bigger advantages of Markdown its great plain text readability, so that people other than programmer can read and write with this language.

I would love to have an option to disable or override some rules in unsafe.js, so that I can ensure the readability (by sacrificing the correctness of generated markdown in rare cases).

wooorm commented 1 year ago

This conversation is going around in circles: this has been answered already: “I would love to have an option to disable” <-- it doesn’t work. It breaks everything.

these \ will confuse whoever don't understand Markdown/escape

And that’s the best solution: people need to learn how markdown works. How to escape things is essential to know in HTML, in JavaScript, in JSX, in any language. And also in markdown.

so that I can ensure the readability (by sacrificing the correctness of generated markdown in rare cases)

If you want to ensure readability, you can instead PR to this project itself by improving when it does and doesn’t escape. Particular the first case could be improved. Because only a +, at a break, followed by [\ \t\r\n] can turn into a list item:

This is much more complex for * and -, which can both start thematic breaks, and the latter can also start setext heading underlines.


I don’t see the escape of [x] going away. If there is an [x]: y added anywhere, it will otherwise turn into a link. unified prevents that from happening.

ocavue commented 1 year ago

it doesn’t work. It breaks everything.

I know. That's some risk I want to take.

people need to learn how markdown works.

I don't think this will work in all case. Imagine that I'm developing an IM App like Slack. We want to use Markdown (at least some of the most common used markdown syntax) in the comment editor. When a user copy some html and paste it into the editor, we want to transform that html into markdown. At that time, these "weird / characters" shows up and users get confuse. Yes their markdown is valid but they still get question. I really don't want to teach a random Slack user what's escape and why it's necessary.

Particular the first case could be improved.

I noticed that part and I'd love to submit a PR to improve it.

Update: I've submitted a PR for it https://github.com/syntax-tree/mdast-util-to-markdown/pull/57

I don’t see the escape of [x] going away.

This is kind of the trade-off I want to make. [x] as checkbox is more common than reference-style links. If I can get ride of / then I'd like to sacrifice the reference-style.


This conversation is going around in circles

I totally understand if you don't want to add a danger API that "breaks everything". Anyway, I just found out how to avoid some escape with the existed API: override the text handle and remove some unsafe rules in the context. If someone who really want to override some unsafe rule, the following code could be an example.

The source of my code ```ts import {type Handle, type Handlers, type Unsafe} from 'mdast-util-to-markdown' import {safe} from 'mdast-util-to-markdown/lib/util/safe' import {text} from 'mdast-util-to-markdown/lib/handle/text' import rehypeParse from 'rehype-parse' import rehypeRemark from 'rehype-remark' import remarkStringify from 'remark-stringify' import {unified} from 'unified' // By default, mdast-util-to-markdown will escape some characters in some // conditions. See `unsafe.js` for all rules: // https://github.com/syntax-tree/mdast-util-to-markdown/blob/1.3.0/lib/unsafe.js#L21 // // In our case, we don't want to apply some of them. function unsafeFilter(rule: Unsafe): boolean { // We don't want to escape '+' at the beginning of a line. This is escaped by // default because '+' can be used to start a list item. However, most time we // use '-' or '*' to start a list item, and '+' is command as the start of a // phone number. if (rule.character === '+') { return false } // We don't want to escape '[' as it's wildly used in checkbox and backlink. if (rule.character === '[') { return false } return true } const textHandle: Handle = (node, parent, context, safeOptions) => { return text( node, parent, {...context, unsafe: context.unsafe.filter(unsafeFilter)}, safeOptions, ) } function htmlToMarkdown(html: string): string { const htmlSyntaxTree = unified().use(rehypeParse).parse(html) const markdownSyntaxTree = unified() .use(rehypeRemark, {checked: '[x] ', unchecked: '[ ] '}) .runSync(htmlSyntaxTree) const handlers: Handlers = {text: textHandle} const markdown = unified() .use(remarkStringify, { handlers, listItemIndent: 'one', }) .stringify(markdownSyntaxTree) return markdown } function main() { const html = `

Phone Number

+1-1234567890

Task list

  • Task list item
` const markdown = htmlToMarkdown(html) console.log(markdown) } main() ```

Output Markdown:

# Phone Number

+1-1234567890

# Task list

* [x] Task list item
wooorm commented 1 year ago

We want to use Markdown (at least some of the most common used markdown syntax) in the comment editor

And that’s the thing, this project is about standards-based markdown, not about some markdown-like features.


On the <input type="checkbox" checked="true">, there is already handling of that: https://github.com/syntax-tree/hast-util-to-mdast/blob/dd1fe81147b4daaedee116d2ef490c3a6d517eef/lib/handlers/li.js#L31-L45. I think in this case it might be because of https://github.com/syntax-tree/hast-util-to-mdast/blob/dd1fe81147b4daaedee116d2ef490c3a6d517eef/lib/handlers/li.js#L28, which wants a <p> element there. I think it would work if it was refactored to:

  if (p(head)) {
    head = head.children[0]
  }

  if (
    input(checkbox) &&
    …