twitter / twitter-text

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.
https://developer.twitter.com/en/docs/counting-characters
Apache License 2.0
3.07k stars 510 forks source link

Allow unicode domain name and path #423

Open LiuQhahah opened 3 months ago

LiuQhahah commented 3 months ago

Problem

The current Java version of this library has a limitation where it fails to recognize URLs containing Unicode characters. This is despite the fact that such URLs are supported by browsers and can be registered and used effectively. For instance, URLs like "http://www.詹姆斯.com/詹姆斯" are not identified as valid URLs. This issue arises from Java's inability to recognize Unicode characters as valid components in domain names and paths..

Solution

To address this issue, I have enhanced the regular expressions used for URL validation in the Java code. Specifically, I have incorporated the Unicode regex \p{L} and \p{M} into the regular expressions that validate the domain name and path of the URL. This modification ensures that the library can now correctly identify and validate URLs containing Unicode characters. Result

With these changes, the library can now correctly identify URLs that include Unicode characters in their domain name or path as valid URLs. For example, a URL like "http://www.詹姆斯.com/詹姆斯" will now be correctly identified as a valid URL. This enhancement broadens the range of URLs that the library can recognize and validate, aligning it more closely with the behavior of modern web browsers.

CLAassistant commented 3 months ago

CLA assistant check
All committers have signed the CLA.