splunk / utbox

URL Toolbox (UTBox) is a set of building blocks for Splunk specially created for URL manipulation. UTBox has been created to be modular, easy to use and easy to deploy in any Splunk environments.
https://preview.splunkbase.splunk.com/app/2734/
Apache License 2.0
8 stars 6 forks source link

Issue with ut_* fields when tld is not in lists #7

Open dbranger opened 1 year ago

dbranger commented 1 year ago

Hi,

We encounter an issue when we use URL Toolbox with subdomains that are not in DAT lists used by the python script.

It seems that the script truncate and merge the end of the URL instead of keeping the last string after a dot.

Here are some examples :

When we add the TLD in DAT files used by the python script for the lists, it works well. Nevertheless we cannot add all possible and imaginable cases. The impact of this issue is concerning the correlation searches that does not detect the correct values.

Would it be please possible to update the python script to change this behavior when it does not find the TLD in DAT files and keep the correct values ? Or maybe is there a reason for that ?

We thank you in advance.

Best regards,

D.BRANGER

ggokdemir commented 1 month ago

Hi @dbranger,

Thank you for bringing this up!

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> tldextract.extract('http://test.containers.internal/')
ExtractResult(subdomain='test.containers', domain='internal', suffix='')
>>> tldextract.extract('http://test.redhat.com.localdomain/')
ExtractResult(subdomain='test.redhat.com', domain='localdomain', suffix='')
>>> tldextract.extract('http://test.centos.pool.ntp.org.xxxlocal/')
ExtractResult(subdomain='test.centos.pool.ntp.org', domain='xxxlocal', suffix='')
>>> tldextract.extract('http://1.something.com.local')
ExtractResult(subdomain='1.something.com', domain='local', suffix='')

Using any library like tldextract, which relies on the Public Suffix List (PSL) to accurately separate a URL's subdomain, domain, and public suffix, doesn't solve the issue. The workaround you mention—adding the TLD in the DAT files used by the Python script for the lists—works. Changes to the code and tests have not been successful in creating a common pattern.

I’ll keep this open. Please let me know if you have any thoughts or suggestions. I’d greatly appreciate any help or feedback! Thank you! I'll keep you posted if I make any changes.

ggokdemir commented 1 month ago

I updated the repository at https://github.com/splunk/utbox/tree/utbox-PSL-update with the latest Public Suffix List from https://publicsuffix.org/list/.