linkedin / URL-Detector

A Java library to detect and normalize URLs in text
783 stars 186 forks source link

Japanese Characters cause the entire string to be detected as a URL #39

Open joeyfedor opened 1 year ago

joeyfedor commented 1 year ago

If you run the detector in the text below, it thinks the whole text is a URL.

我进入你的主页很卡顿,也许是你的关注人数或者其他数据太多了,其他人主页没有这么卡顿。来自amethyst客户端

Characters 。 and , are single characters and are not considered spaces in this library.

mattn commented 1 year ago

Using linkedin/URL-Detector is not good for detecting URLs for content which can be contained with multi-byte strings. Following test case matches Chinese/Japanese text usual.

https://github.com/linkedin/URL-Detector/blob/368c4e4481714a7f9c271515131cbf0282759006/url-detector/src/test/java/com/linkedin/urls/detection/TestUriDetection.java#L214