Open frankdilo opened 1 year ago
This also applies to other examples like rohlík.cz or neovlivní.cz.
@frankdilo, @Olaf-: Unfortunately those URLs are not valid according to RFC.
RFC3986
host = IP-literal / IPv4address / reg-name
where
reg-name = *( unreserved / pct-encoded / sub-delims )
and from that
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
and from that and RFC2234
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
As you can see domain name can't contain characters from UTF-8 (with some accents, hooks, ... )
I am open to discussion but I would suggest a workaround to convert all characters to ASCII an then use URLExtract to find the URLs and its position and extract the URLs from original text.
Also applies to fully Cyrillic domains like сайт.рф (even if you prepend it with https://). Would be great to see it fixed.
E.g., twitter-text in Ruby handles this properly: https://github.com/twitter/twitter-text/blob/master/rb/lib/twitter-text/regex.rb#L257)
URLExtract
does not match this URL as it should:сайт.com