lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
243 stars 61 forks source link

Support non-unicode hostname #153

Open frankdilo opened 1 year ago

frankdilo commented 1 year ago

URLExtract does not match this URL as it should: сайт.com

Olaf- commented 11 months ago

This also applies to other examples like rohlík.cz or neovlivní.cz.

lipoja commented 9 months ago

@frankdilo, @Olaf-: Unfortunately those URLs are not valid according to RFC.

RFC3986 host = IP-literal / IPv4address / reg-name where reg-name = *( unreserved / pct-encoded / sub-delims ) and from that unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" and from that and RFC2234 ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

As you can see domain name can't contain characters from UTF-8 (with some accents, hooks, ... )

I am open to discussion but I would suggest a workaround to convert all characters to ASCII an then use URLExtract to find the URLs and its position and extract the URLs from original text.

hwo411 commented 7 months ago

Also applies to fully Cyrillic domains like сайт.рф (even if you prepend it with https://). Would be great to see it fixed.

E.g., twitter-text in Ruby handles this properly: https://github.com/twitter/twitter-text/blob/master/rb/lib/twitter-text/regex.rb#L257)