Invalid matching with Cyrillic TLDs

mvdan / xurls

Extract urls from text

BSD 3-Clause "New" or "Revised" License

1.19k stars 116 forks source link

Invalid matching with Cyrillic TLDs #32

Closed bynov closed 5 years ago

bynov commented 5 years ago

Hi! It seems like there is a problem with Cyrillic TLDs. Here an example:

echo "test.xyz" | xurls -r
test.xyz
echo "test.xyz test" | xurls -r
test.xyz
echo "test.бел" | xurls -r
test.бел
echo "test.бел test" | xurls -r 
<empty response>

If there are any symbols, even whitespace after cyrillic domain - it's not match anymore. I tried to solve that issue and found that it can be something in string but I don't sure

webURL := hostName + port + `(/|/` + pathCont + `?|\b|(?m)$)`

In \b part. I tried to use |\b|\B but some tests failed.

Thanks!

mvdan commented 5 years ago

Thanks for reporting this bug. It seems to happen with all non-ASCII TLDs - I was able to reproduce it with chinese and korean. We never had a test specifically for this case where a TLD is directly followed with a space, so I think this never worked.

bynov commented 5 years ago

Thanks for a reply, @mvdan So, could we fix it somehow? Maybe you have some ideas regarding this case. I want to help with that but I’m weak in regexps, especially that heavy regexps :) Any ideas would be very helpful

mvdan commented 5 years ago

The issue is that \b (the word boundary) isn't unicode-aware, so it thinks chinese and other alphabets are non-words. I don't think there is a good fix here. The \b was used so that foo.comgarbage didn't match foo.com; we might have to lose that feature, or restrict it to ASCII TLDs only.