left walk does not stop on various unicode chars

lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.

MIT License

241 stars 61 forks source link

left walk does not stop on various unicode chars #121

Closed amoldavsky closed 5 months ago

amoldavsky commented 2 years ago

>>> from urlextract import URLExtract
>>> extractor = URLExtract()
>>> extractor.find_urls("You can also visit my website…IMINIT.MYAMBIT.COM")
['website…IMINIT.MYAMBIT.COM']
>>> extractor.find_urls("some%sIMINIT.MYAMBIT.COM" % chr(8231))
['some‧IMINIT.MYAMBIT.COM']

These are not valid URL characters (going to the left)

lipoja commented 5 months ago

Only ASCII is allowed on left from TLD. This case should be fixed in next release.