lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Handle upper-case false positives #143

Closed GokulNC closed 1 year ago

GokulNC commented 1 year ago

Thanks for this awesome library!

For example, for the string "S.No. 3", the substring "S.No." is getting matched as a URL. Is it possible to enforce the fact that domain names will never be in upper-case? (thereby avoiding false positives like above)

lipoja commented 1 year ago

@GokulNC Thank you, I am glad that you are using it.

We can discuss this problem, however according to RFC3986 host = IP-literal / IPv4address / reg-name where reg-name = *( unreserved / pct-encoded / sub-delims ) and from that unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" and from that and RFC2234 ALPHA = %x41-5A / %x61-7A ; A-Z / a-z I assume that domain names may contain upper-case letters.

Also RFC3986 says: The host subcomponent is case-insensitive. Therefore I assume that somebody can write EXAMPLE.COM and he/she might be expecting to match (extract) this domain.

Do you agree? Maybe I would go in your case with some pre-processing of text? That could help, right?

My opinion is that we can not do specific implementation to this library. I am trying to keep it as generic as possible and I do it according the books (RFCs). And I would expect that users knows the text they are processing so some special tweaks of text might be needed.

GokulNC commented 1 year ago

Thanks for your response!

I came across this clarification regarding the above: RFC4343 Please check it out and let me know if you still think the same.

lipoja commented 1 year ago

@GokulNC Thank you for this RFC. I went through the document and I still stand by my opinion. The RFC4343 is about DNS. If I am not wrong then DNS server should accept case-insensitive domain names. That means that in DNS request domain can appear lover-case, upper-case or combination and DNS should still return results.

I might not be correct, maybe I missed something. If it is that case could you quote from the RFC here so we can discuss it?

Thank you.

GokulNC commented 1 year ago

Yes you are right, the domain names are treated case-insensitive, by lower-casing everything at the DNS server side.

One suggestion for this library if possible:
We can add a parameter called match_only_lowercase_domains, which can default to False as you suggested.
I believe a flag like this would give more flexibility to the users to avoid false positives like above.

Thanks!

lipoja commented 1 year ago

I will keep this suggestion in my mind. However I want to help you. What about using urlextract.ignore_list?

from urlextract import URLExtract

urlextract = URLExtract()
urlextract.ignore_list = {"s.no"}
urlextract.find_urls("random text example.com S.No. 3")

outputs:

['example.com']
GokulNC commented 1 year ago

Thanks @lipoja ! Was not aware of ignore_list.

But, this is not possible to do, since it is not possible to construct the list of all false-positives involving such upper-case strings.

For example, consider this random example:

>>> urlextract.find_urls("I am sitting outside.In the middle of nowhere.My mind is lost in thoughts!")
['outside.In', 'nowhere.My']
lipoja commented 1 year ago

@GokulNC alright then in next release it will be available. You can use urlextract.allow_mixed_case_hostname = False it should do the trick

lipoja commented 1 year ago

Released v1.8.0

GokulNC commented 1 year ago

That's awesome! Thanks alot @lipoja :)