Allow unicode domain name and path

Problem

The current Java version of this library has a limitation where it fails to recognize URLs containing Unicode characters. This is despite the fact that such URLs are supported by browsers and can be registered and used effectively. For instance, URLs like "http://www.詹姆斯.com/詹姆斯" are not identified as valid URLs. This issue arises from Java's inability to recognize Unicode characters as valid components in domain names and paths..

Solution

To address this issue, I have enhanced the regular expressions used for URL validation in the Java code. Specifically, I have incorporated the Unicode regex \p{L} and \p{M} into the regular expressions that validate the domain name and path of the URL. This modification ensures that the library can now correctly identify and validate URLs containing Unicode characters. Result

With these changes, the library can now correctly identify URLs that include Unicode characters in their domain name or path as valid URLs. For example, a URL like "http://www.詹姆斯.com/詹姆斯" will now be correctly identified as a valid URL. This enhancement broadens the range of URLs that the library can recognize and validate, aligning it more closely with the behavior of modern web browsers.

twitter / twitter-text

Allow unicode domain name and path #423