lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Bug with flag `allow_mixed_case_hostname=False` #151

Closed GokulNC closed 5 months ago

GokulNC commented 1 year ago

Thanks for this great library!

Example issue:

>>> import urlextract
>>> url_extractor = urlextract.URLExtract()
>>> url_extractor.allow_mixed_case_hostname = False
>>> url_extractor.find_urls("main_data_site.group.popular_data_desc")
['main_data_site.group.popular_data_desc']
>>> url_extractor.allow_mixed_case_hostname = True
>>> url_extractor.find_urls("main_data_site.group.popular_data_desc")
[]

Why is there such false-positives with allow_mixed_case_hostname=False ? @lipoja

GokulNC commented 1 year ago

Found even more worse false-positive: @lipoja

>>> url_extractor.allow_mixed_case_hostname = False
>>> url_extractor.find_urls("144.2 MB")
['144.2']
lipoja commented 1 year ago

@GokulNC Thank you for reporting it. If you have time for PR that would be nice and welcome. Otherwise I would like to ask you to keep reporting broken thing and wait a little bit for a fix from my side. I do not have much free time right now. My guess is that I can take a look in a month :(

GokulNC commented 1 year ago

No problem, thanks :)

lipoja commented 5 months ago

Should be fixed in next release