mdx-js / mdx

Markdown for the component era
https://mdxjs.com
MIT License
17.42k stars 1.14k forks source link

Markdown link syntax is not supported #2113

Closed guoyunhe closed 2 years ago

guoyunhe commented 2 years ago

Initial checklist

Affected packages and versions

@mdx-js/rollup 2.1.2

Link to runnable example

No response

Steps to reproduce

  1. go to https://mdxjs.com/playground/
  2. type the following Markdown in the editor
My blog is <https://guoyunhe.me/>
My email is <i@guoyunhe.me>

Expected behavior

Should be rendered as links.

It is very easy to differ Markdown links from JSX components: URL must have a / and email address must have a @. However, JSX tag must not contain / or @.

Actual behavior

Compile error.

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

wooorm commented 2 years ago

This is intentional as there are cases where autolinks conflict with JSX tags. It is documented here: https://mdxjs.com/docs/what-is-mdx/#markdown.

tats-u commented 1 year ago

@wooor XML shouldn't accept :// or @ as a tag name right?

We don't have to consider other schemes without // because they are useless as auto links.

wooorm commented 1 year ago

Read the comment above your comment. Use links: [text](url), raw URLs are bad anyway. If you want raw URLs, use remark-gfm.

tats-u commented 1 year ago

@wooorm I've read. : itself is valid as XML tag names but / & @ are not.

https://www.w3.org/TR/xml/

Names and Tokens
[4] | NameStartChar | ::= | ":" \| [A-Z] \| "_" \| [a-z] \| [#xC0-#xD6] \| [#xD8-#xF6] \| [#xF8-#x2FF] \| [#x370-#x37D] \| [#x37F-#x1FFF] \| [#x200C-#x200D] \| [#x2070-#x218F] \| [#x2C00-#x2FEF] \| [#x3001-#xD7FF] \| [#xF900-#xFDCF] \| [#xFDF0-#xFFFD] \| [#x10000-#xEFFFF] -- | -- | -- | -- [4a] | NameChar | ::= | NameStartChar \| "-" \| "." \| [0-9] \| #xB7 \| [#x0300-#x036F] \| [#x203F-#x2040] [5] | Name | ::= | NameStartChar (NameChar)*
Start-tag
[40] | STag | ::= | '<' Name (S Attribute)* S? '>' | [WFC: Unique Att Spec] -- | -- | -- | -- | --
  1. Try parsing as JSX
  2. If fails, try parsing as autolinks

This should be perfect for URLs with :// and emails without space. Of course you can treat links with : without / as errors or XML.

Use links: text, raw URLs are bad anyway.

In technical documents, sometimes we want users to remember or copy-and-paste URLs sometimes.

This autolink feature is important in Japanese (and Chinese). react-gfm can't cover perfectly.

VS Codeのソースコードは、<https://github.com/microsoft/vscode>で入手できます。

We don't insert spaces around URLs and other native words (in not a few cases) unlike western languages: https://github.com/prettier/prettier/issues/6385 Some people do but it's not so much common (especially Japanese).

wooorm commented 1 year ago
  1. This project has nothing to do with XML. XML is not JSX. JSX relates to JS. MDX relates to JSX and markdown (CommonMark)
  2. / is fine in JSX: it closes a tag.
  3. Try parsing as JSX; If fails, try parsing as autolinks

    No. Ambiguous grammars like that are slow, unsafe, and hard to explain to users.

  4. The mailto and xmpp protocol are written without //, anyone can come up with any protocol; and in JSX, anyone can use any namespace. <svg:rect/> is an autolink according to markdown. <whatever:whatever> is an autolink according to markdown. And they are also valid JSX.
tats-u commented 1 year ago

@wooorm

This project has nothing to do with XML. XML is not JSX. JSX relates to JS.

SVG is XML. X of JSX is that of XML(-like). XHTML, HTML with XML, doesn't allow <br> without </br> like JSX.

/ is fine in JSX: it closes a tag.

It's just single /. Not double ones.

<svg:rect/>

: is not adjacent to / so it should be compatible with http(s)://.

wooorm commented 1 year ago

JavaScript isn’t Java. SVG in HTML isn’t XML. SVG in JSX isn’t XML.


https://spec.commonmark.org/dingus/?text=a%20%3Csvg%3Arect%2F%3E%20b

^-- it links.

https://spec.commonmark.org/dingus/?text=a%20%3Cwhatever%3Awhatever%3E%20b

^-- it links.

tats-u commented 1 year ago

@wooorm

SVG in HTML isn’t XML. SVG in JSX isn’t XML.

Their syntax is very similar as XML. See the next flowchart.

Do you understand what I want to say?

flowchart TD
    left(["&lt;"]) --> ns["NameSpace or Tag"]
    ns --> colon[:]
    colon --> ns
    ns --> at["@"]
    at --> domain["E-mail domain"]
    domain --> em([E-mail])
    ns --> ems["E-mail specific symbols\ne.g. +"]
    ems --> emn["Email account"]
    emn --> at
    ns --> sp["␣ or /"]
    colon --> sl["//"]
    sl --> url([URL])
    sp --> jsx([JSX])

This flowchart perfectly recognizes <svg:links/> as JSX.

wooorm commented 1 year ago

You haven't opened .svg images by a text editor, have you? It's just a XML.

That is not what you were saying. That is (sometimes, depending on doctype) XML. Because that is not SVG in HTML. Please read the HTML spec on how foreign content works in HTML. Thanks for changing that.

lol

Don’t be insulting. Thanks for changing that.

Their syntax is very similar as XML.

That is different.

Do you understand what I want to say?

No, you are saying different things all the time.

This flowchart perfectly recognizes as JSX.

You made a flow chart of something that doesn’t match the JSX grammar and that doesn’t match the markdown grammar.

Here’s the grammar for autolinks: https://github.com/wooorm/markdown-rs/blob/24d78c3420980f3e56fc420f8efc7e601b144ee7/src/construct/autolink.rs#L9-L20. Here’s the grammar for a JSX tag: https://github.com/wooorm/markdown-rs/blob/24d78c3420980f3e56fc420f8efc7e601b144ee7/src/construct/partial_mdx_jsx.rs#L12-L59

They conflict.

Introducing some new grammar that doesn’t match JSX and doesn’t match markdown doesn’t sound like a great idea.

Your flowchart is missing a lot of detail.

tats-u commented 1 year ago

Here’s the grammar for autolinks: https://github.com/wooorm/markdown-rs/blob/24d78c3420980f3e56fc420f8efc7e601b144ee7/src/construct/autolink.rs#L9-L20. Here’s the grammar for a JSX tag: https://github.com/wooorm/markdown-rs/blob/24d78c3420980f3e56fc420f8efc7e601b144ee7/src/construct/partial_mdx_jsx.rs#L12-L59

They conflict.

Have you thought of merge these grammers? I've just tried to a little. Why not give priority to analysis as JSX, and if it is not accepted, why not just analyze it as an address? Is it so hard to do so? And is this JS MDX parser uses that Rust code as wasm? If so this issue can be postponed because we have to prepare a new Rust crate of the integrated parser. "PR only" is much better than the current "won't fix".

Your flowchart is missing a lot of detail.

Don't complain about the brief outline I wrote in a short time. Suggest me some "corner cases".

Introducing some new grammar that doesn’t match JSX and doesn’t match markdown

I don't say this is an introduction. Just a (partial) revival of the feature you've thrown away.

wooorm commented 1 year ago

Have you thought of merge these grammers? I've just tried to a little.

This is the problem. This is open source. I think it doesn’t work. You come up and say that it does. That’s on you to come up with something that does.

Why not give priority to analysis as JSX, and if it is not accepted

JS(X) has syntax errors. It doesn’t say “didn’t match”, let’s try other things. It says “crash!”.

And is this JS MDX parser uses that Rust code as wasm?

Maybe look through how the projects work? I.e., see https://mdxjs.com/packages/mdx/#architecture for docs, or read the code. There are rust and javascript versions.

Don't complain about the brief outline I wrote in a short time.

Spend time on your proposals. It’s on you to convince people.

Suggest me some "corner cases".

No, you can come up with something that works.

I don't say this is an introduction. Just a (partial) revival of the feature you've thrown away.

It is: you propose a new grammar. Something that isn’t JSX and something that isn’t markdown autolinks.

tats-u commented 1 year ago

It looks quite difficult to implement realistically from what you've said. OK. That's it.

It says “crash!”.

This is critical. So we have to create a JS(X)-based original parser instead of combining existing JSX and Markdown parsers, but... It's never what we want to maintain. Thanks.

tats-u commented 1 year ago

Is the processing order Micromark → JSX parser? "Always disable automatic links " → "Disable automatic links that have even the slightest chance of being JSX (including <svg:links/>)"

It looks easier than I thought it would earlier. But it does not seem to be the first issue to be handled in this repository. Micromark-related repos are more appropriate. Bye.

wooorm commented 1 year ago

Micromark parses ESM/expressions/JSX itself.

tats-u commented 1 year ago

Always disable automatic links

https://github.com/micromark/micromark-extension-mdx-md

Disable automatic links that have even the slightest chance of being JSX

Copy and modify https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/autolink.js or add an option to it (Need to look more closely at the code to determine if it is possible but looks difficult)

Micromark parses ESM/expressions/JSX itself.

You're right according to https://github.com/micromark/micromark-extension-mdxjs

wooorm commented 1 year ago

You're right according to

Yes, I wrote all this code 😅

Need to look more closely at the code to determine if it is possible but looks difficult

Difficult is one thing.

It’s unclear what you propose.

That’s the problem.

You can also make a flow chart, or BNF grammar with what I provided above.

To make your ideas much more concrete, to show it actually work. And how it would work.

tats-u commented 1 year ago

Here is an draft of BNF. I don't know it can work with LR-like or PEG parser used by Micromark (I've not read the details of the parser of Micromark yet).

mustNeverBeJsx ::= "<" maybeUrlOrEmail ">"
maybeUrlOrEmail ::= httpOrHttps | email
httpOrHttps ::= "http" /s?/ "://" urlContent
urlContent ::= (at least conforms to the CommonMark autolink's specification)
email ::= commonAccount "@" domain
commonAccount ::= /[!#$%&'*+\-/=?^_`|}~a-zA-Z0-9.]{1,64}/
domain ::= (at least conforms to the CommonMark autolink's specification / max length is 255 / ascii-only)

Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed.

RFC 5322 allows quoted string addresses (not accepeted by JSX) like "foo \"@\ bar"@example.com but it's not common.

https://www.rfc-editor.org/rfc/rfc1738#section-2.1

I don't know if { in emails is interoperable with JSX. {... is not valid email account name because consecutive periods without quoted is invalid in emails.

This have only to cover most cases and doesn't have to cover corner cases (especially emails)

It’s unclear what you propose. That’s the problem.

Currently the MDX parser disables all of the autolink parser, but I wish we could temporarily replace it with a new restricted version interoperable with JSX only when used by MDX.

I'd submit a PR to Micromark repos if I had a time.


The GFM autolink feature is sometimes incompatible with Japanese. I don't like inserting space between Japanese and URLs. I would keep it as a last resort. This is why I wish MDX had this feature.

GFM: 私のポートフォリオは、https://日本語domain.example/で公開しています

This must be

私のポートフォリオは、https://日本語domain.example/で公開しています。

However, there are some URLs like https://example.com/articles/日本語title.

Even in western languages, ever wish you could put punctuation right after the URL? (not so common though)

How to create a slide using Google Spreadsheets:

  1. Access to https://slide.new.
wooorm commented 1 year ago

I don't know it can work with LR-like or PEG parser used by Micromark (I've not read the details of the parser of Micromark yet).

It is impossible to parse markdown with a PEG parser, or any similar thing, because markdown is not a regular language. micromark (and markdown-rs) are instead implemented as state machines. The BNF grammars are for reference. More on this here: https://github.com/wooorm/markdown-rs/blob/56cd834cf88a58d3429b4b75489f161d57b28eaa/src/construct/mod.rs#L93C2-L112.

Here is an draft of BNF.

Thanks. I’ve made it a bit more like how the other grammars are written and used the other parts that CM does (https://github.com/wooorm/markdown-rs/blob/56cd834cf88a58d3429b4b75489f161d57b28eaa/src/construct/autolink.rs).

autolink_not_jsx ::= '<' (email | url_not_jsx) '>'
email ::= ; see `email` in https://github.com/wooorm/markdown-rs/blob/56cd834cf88a58d3429b4b75489f161d57b28eaa/src/construct/autolink.rs
url_not_jsx ::= ('h' | 'H') 2('t' | 'T') ('p' | 'P') ['s' | 'S'] ':' 2'/' *url_byte
url_byte ::= see `url_byte` in https://github.com/wooorm/markdown-rs/blob/56cd834cf88a58d3429b4b75489f161d57b28eaa/src/construct/autolink.rs

Q: why did you not include " or { in common_account? CM does in ascii_atext

RFC 5322

We need to refer to CM. CM doesn’t allow everything that RFC 5322 allows. No need to worry about that.

I don't know if { in emails is interoperable with JSX. {... is not valid email account name because consecutive periods without quoted is invalid in emails.

JSX currently needs a spread. But that’s an intentional narrowing down. It’s more that: currently, that’s the only value allowed inside the braces.

I wish we could temporarily replace it with a new restricted version interoperable with JSX only when used by MDX.

What do you mean with “temporarily replace”?

I am very much not sure whether it’s a good idea to implement a different algorithm, that isn’t CM or JSX. It is hard to explain to users. It doesn’t cover all their needs (other protocols). I don’t want to maintain a list of protocols that are supported here with two slashes (such as ftp). Protocols without colons exist (such as mailto, tel), it is difficult to explain to users why those don’t work.

The GFM autolink feature is sometimes incompatible with Japanese. I don't like inserting space between Japanese and URLs. I would keep it as a last resort. This is why I wish MDX had this feature.

There are viable alternatives already, you can use:

  1. actual links with accessible, descriptive text:
    xxx[yyy](https://example.com)zzz
  2. actual links with the URL visible:
    xxx[https://example.com](https://example.com)zzz
  3. actual links with a subset of the URL visible, in code, because it’s not words, it’s code (my preference):
    xxx[`example.com`](https://example.com)zzz

Perheps you might be interested in reading more about providing useful link text: https://wcag.com/blog/writing-meaningful-link-text/

ever wish you could put punctuation right after the URL? (not so common though)

I do frequently use that. Your post shows it working. That has to do with how GH works though. You can use remark-gfm with Latin-script or Cyrillic languages to get that behavior in MDX.

Without it, you can use actual links to get that behavior:

Xxx[yyy](url).