Open dbranger opened 1 year ago
Hi @dbranger,
Thank you for bringing this up!
>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> tldextract.extract('http://test.containers.internal/')
ExtractResult(subdomain='test.containers', domain='internal', suffix='')
>>> tldextract.extract('http://test.redhat.com.localdomain/')
ExtractResult(subdomain='test.redhat.com', domain='localdomain', suffix='')
>>> tldextract.extract('http://test.centos.pool.ntp.org.xxxlocal/')
ExtractResult(subdomain='test.centos.pool.ntp.org', domain='xxxlocal', suffix='')
>>> tldextract.extract('http://1.something.com.local')
ExtractResult(subdomain='1.something.com', domain='local', suffix='')
Using any library like tldextract, which relies on the Public Suffix List (PSL) to accurately separate a URL's subdomain, domain, and public suffix, doesn't solve the issue. The workaround you mention—adding the TLD in the DAT files used by the Python script for the lists—works. Changes to the code and tests have not been successful in creating a common pattern.
I’ll keep this open. Please let me know if you have any thoughts or suggestions. I’d greatly appreciate any help or feedback! Thank you! I'll keep you posted if I make any changes.
I updated the repository at https://github.com/splunk/utbox/tree/utbox-PSL-update with the latest Public Suffix List from https://publicsuffix.org/list/.
Hi,
We encounter an issue when we use URL Toolbox with subdomains that are not in DAT lists used by the python script.
It seems that the script truncate and merge the end of the URL instead of keeping the last string after a dot.
Here are some examples :
When we add the TLD in DAT files used by the python script for the lists, it works well. Nevertheless we cannot add all possible and imaginable cases. The impact of this issue is concerning the correlation searches that does not detect the correct values.
Would it be please possible to update the python script to change this behavior when it does not find the TLD in DAT files and keep the correct values ? Or maybe is there a reason for that ?
We thank you in advance.
Best regards,
D.BRANGER