mathiasbynens / punycode.js

A robust Punycode converter that fully complies to RFC 3492 and RFC 5891.
https://mths.be/punycode
MIT License
1.59k stars 158 forks source link

ZWJ support #114

Open NetOpWibby opened 3 years ago

NetOpWibby commented 3 years ago

It would be nice if this library supported ZWJ emoji. Currently, it displays such emoji as several instead of just one.

pretended commented 3 years ago

ZWJ emojis are still considered as single emojis, so that would be really great.

mathiasbynens commented 3 years ago

Can you please clarify this bug report? The Punycode encoding is unrelated to “displaying” or rendering characters. What makes you say ZWJ emoji are unsupported? Please provide an example.

pretended commented 3 years ago

imagen

New emojis are displayed as combination oF emojis just like the photo shows. It would be great that we could display the new updated (combination) emojis as single emojis and not a combination (or a ZWJ sequence) of emojis.

Not really a bug report, but an enhacement.

However, maybe punycode is unrelated to displaying ZWJ emojis as single.

NetOpWibby commented 3 years ago

With this library, xn--qq8hq8f is supposed to return 👨‍🦰 (man with red hair). Instead, it outputs 👨🦰 (man, red hair).

mathiasbynens commented 3 years ago

With this library, xn--qq8hq8f is supposed to return 👨‍🦰 (man with red hair).

What makes you say that?

The inverse: 👨‍🦰.com encodes to xn--1ugz855p6kd.com per https://mothereff.in/punycode#%F0%9F%91%A8%E2%80%8D%F0%9F%A6%B0.com, which seems to roundtrip correctly.

NetOpWibby commented 3 years ago

xn--1ugz855p6kd is invalid punycode and using the IDNA2008 standard. It should be xn--qq8hq8f, using the IDNA2003 standard.

Emoji input from your phone creates the 2003 standard.

NetOpWibby commented 3 years ago

Via idna-uts46-hx:

Unfortunately, the situation of internationalized domain names is rather complicated by the existence of multiple incompatible standards (IDNA2003 and IDNA2008, predominantly). While UTS#46 tries to bridge the incompatibility, there are four characters which cannot be so bridged: ß (the German sharp s), ς (Greek final sigma), and the ZWJ and ZWNJ characters. These are handled differently depending on the mode; in transitional mode, these strings are mapped to different ones, preserving capability with IDNA2003; in nontransitional mode, these strings are mapped to themselves, in accordance with IDNA2008.

(transitional mode is) compatible with all known browser implementations at this point.

GuillaumeBlanchet commented 2 years ago

IDNA2003 is deprecated nonetheless. cURL uses IDNA2008 like many other things on your computer. Unfortunately, many browsers are not up to date...

NetOpWibby commented 2 years ago

Deprecation or not, there's clear reasons why it's still being used.

The JS Punycode converter library is a great tool for handling Unicode domain names, but it only implements the Punycode encoding of domain labels, not the full IDNA algorithm. In simple cases, a mere conversion to lowercase text before input would seem sufficient, but the real mapping for strings is far more complex.