syntax-tree / mdast-util-to-hast

utility to transform mdast to hast
https://unifiedjs.com
MIT License
102 stars 42 forks source link

Support IDNs / Punycode #44

Closed ilyaigpetrov closed 4 years ago

ilyaigpetrov commented 4 years ago

Subject of the feature

Support Internationalized Domain Names (IDNs) in urls of links.

Problem

If I use IDNs in my markdown, e.g. [example](https://я.рф/я) then I get я.рф part percent-encoded (instead of encoding it with Punycode), e.g. I get https://%D1%8F.%D1%80%D1%84/%D1%8F. Such links become just broken.

Expected behavior

https://я.рф/я must be encoded as https://xn--41a.xn--p1ai/%D1%8F.

Alternatives

Instead of using mdurl/encode I suggest using new URL(node.url).href. I warn you that I'm not sure what is the purpose of mdurl/encode and why it was created and used here.

wooorm commented 4 years ago

Hi there! Pretty interesting, but: the current behavior is compliant to the commonmark reference impeimentation: https://spec.commonmark.org/dingus/?text=%5Bexample%5D(https%3A%2F%2Fя.рф%2Fя). Most markdown parsers behave worse or the same as this project. The downside I see of implementing punycode is that it is big: micromark is ±13kb, punycode is 130kb. However. this is something that could be added to commonmark if you get with you. So, I’ll refer you to their Discourse.

ilyaigpetrov commented 4 years ago

What's wrong with using new URL(idnUrl).href for encoding, not an external Punycode library?

ilyaigpetrov commented 4 years ago

Maybe I can write a plugin for remark that handles it properly? Can we add an option to disable percent-encoding in this syntax tree transformer?

wooorm commented 4 years ago

That is not the main point: the main point is that other markdown parsers do it like we do it. But: new URL(idnUrl).href crashes on certain URLs and markdown never crashes.

Btw: I believe you can do this all yourself in a plugin before mdast-util-to-hast. See Creating a plugin and Tree traversal

ilyaigpetrov commented 4 years ago

I've created a feature request to commonmark here: https://talk.commonmark.org/t/feature-support-of-idns-internationalized-domain-names-punycode/3672 -- Crissov says it is implied that punycode must work.

wooorm commented 4 years ago

Crissov says it is implied that punycode must work.

That’s not how I read that comment. Crissov links to the same Babelmark I linked to.

ilyaigpetrov commented 4 years ago

No, @wooorm: 1) "Markdig and markdown-it convert the link destination as desired" (punycode is desired). 2) Then he links to babelmark as a prove of his words (markdown-it and markdig there use punycode: https://xn--41a.xn--p1ai/%D1%8F). 3) "I’m not sure this needs to be specified explicitly" (the desired behavior, i.e. punycode, needs to be specified explicitly).

Anyway I've asked him directly on the forum --, please, subscribe to the topic so I don't have to repost it here.

wooorm commented 4 years ago

I read it differently:

  1. "Markdig and markdown-it convert the link destination as desired" (punycode is desired).

“There are some parsers that do what you (@ilyaigpetrov) desire. (but most don’t)”

  1. Then he links to babelmark as a prove of his words (markdown-it and markdig there use punycode: https://xn--41a.xn--p1ai/%D1%8F).

The same I showed: some parsers do what you want, most parsers do it differently, unified/remark/micromark does what the commonmark.js reference parser does and GFM does.


Why is this so important? Whether punycode is used or not, it works (taking a live example):

[Unicode](https://räksmörgås.josefsson.org),  [Punycode](https://xn--rksmrgs-5wao1o.josefsson.org).

Unicode, Punycode.

Example w/ unified: https://runkit.com/wooorm/5f91431c66c630001a6f7d71