Closed bynov closed 5 years ago
Thanks for reporting this bug. It seems to happen with all non-ASCII TLDs - I was able to reproduce it with chinese and korean. We never had a test specifically for this case where a TLD is directly followed with a space, so I think this never worked.
Thanks for a reply, @mvdan So, could we fix it somehow? Maybe you have some ideas regarding this case. I want to help with that but I’m weak in regexps, especially that heavy regexps :) Any ideas would be very helpful
The issue is that \b (the word boundary) isn't unicode-aware, so it thinks chinese and other alphabets are non-words. I don't think there is a good fix here. The \b was used so that foo.comgarbage didn't match foo.com; we might have to lose that feature, or restrict it to ASCII TLDs only.
Hi! It seems like there is a problem with Cyrillic TLDs. Here an example:
If there are any symbols, even whitespace after cyrillic domain - it's not match anymore. I tried to solve that issue and found that it can be something in string but I don't sure
In
\b
part. I tried to use|\b|\B
but some tests failed.Thanks!