twitter / twitter-text

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.
https://developer.twitter.com/en/docs/counting-characters
Apache License 2.0
3.07k stars 510 forks source link

Allow emoji domain names #420

Open mhlz opened 1 year ago

mhlz commented 1 year ago

Problem

Currently the ruby version of this library does not recognize links to domains that include emojis, even though browsers support those domains. Texts that include "https://🌈🌈🌈.st" will not be accepted as a valid URL. The problem comes from idn-ruby and libidn2, which does not recognize emoji characters as valid for domain names, even though they are registerable and work fine in browsers (after being translated into punycode).

Solution

I replaced idn-ruby with another rubygem that implements the punycode conversion in ruby directly without native dependencies and then added some validation that libidn2 did to pass the conformity test suite again.

Result

Texts that include "https://🌈🌈🌈.st" will now correctly identified as including a link. Note that currently there are more checks in this library that prevent "🌈🌈🌈.st" from being parsed as a link. While I would like to make that work as well, I felt like that would be too big of a change.

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.