splunk / utbox

URL Toolbox (UTBox) is a set of building blocks for Splunk specially created for URL manipulation. UTBox has been created to be modular, easy to use and easy to deploy in any Splunk environments.
https://preview.splunkbase.splunk.com/app/2734/
Apache License 2.0
8 stars 6 forks source link

Weirdness parsing dyndns.org #3

Closed pirxthepilot closed 2 years ago

pirxthepilot commented 2 years ago

Seeing some weird behavior when parsing dyndns.org FQDNs with ut_parse_extended and the mozilla list.

Expected Behavior:

With an FQDN like foo.bar.google.com, it correctly shows the tld, domain, subdomain and number of subdomain elements.

| makeresults
| eval url="foo.bar.google.com"
| eval list="mozilla"
| `ut_parse_extended(url, list)`
| table list url ut_tld ut_domain ut_subdomain ut_subdomain_count

image

Issue

Parsing foo.bar.dyndns.org, ut seems to think that the TLD is dyndns.org and the domain is bar.dyndns.org

| makeresults
| eval url="foo.bar.dyndns.org"
| eval list="mozilla"
| `ut_parse_extended(url, list)`
| table list url ut_tld ut_domain ut_subdomain ut_subdomain_count

image

Even weirder, with foo.go.dyndns.org, ut parses go.dyndns.org as the TLD, and foo.go.dyndns.org as the domain.

| makeresults
| eval url="foo.go.dyndns.org"
| eval list="mozilla"
| `ut_parse_extended(url, list)`
| table list url ut_tld ut_domain ut_subdomain ut_subdomain_count

image

pirxthepilot commented 2 years ago

From initial digging, looks like both dyndns.org and go.dyndns.org are in Mozilla's public suffix list file. Having discovered that list I'm now not sure why there are a lot of entries in the list that don't look like TLDs. Am I missing something?

dfederschmidt commented 2 years ago

Hi @pirxthepilot - Thanks for reaching out!

You're right - dyndns.org as well as go.dyndns.org are on the Mozilla public list file. This list is community curated and tracks effective TLDs. This means domains under which multiple parties that are unaffiliated with the operator of the domain may register subdomains [1].

In this specific case, it seems like dyn allows their customers to register their own subdomains. They act as the registrar for that domain.

So in the context of the list, the result seems correct. Of course, the IANA list, which is used by default, would extract org as ut_tld. However, the IANA list only contains TLDs with no dots (eg. co.uk is not included). But in a technical sense, there seems to be no inherent difference between the registrar of co.uk and dyndns.org - they both allow registration their below their domain.

Does this make sense? Please let me know your thoughts. Of course, you could take advantage of the custom list feature and provide your own curated list.

[1] https://www.icann.org/en/system/files/files/octo-011-18may20-en.pdf

pirxthepilot commented 2 years ago

Hi @dfederschmidt , thanks for providing context, much appreciated! I think I understand now. Having thought about it more I think the Mozilla public list vs the "intuitive" behavior (e.g. dyndns.org as a domain) are both valid, depending on the context of the search.

That said, it can be confusing to a Splunk user who is not aware of effective TLDs and so writing their searches incorrectly (at least for the domains included in the Mozilla list). I think (but could be wrong) most folks would expect similar results between www.wikipedia.org and www.dyndns.org in that ut_tld is org and ut_domain are wikipedia.org and dyndns.org respectively. (The IANA list would have been nice, but like you said, it only supports one level.)

Curious to know what you think as well! Would it be worth having a separate list that does not include effective TLDs?

Also, if we were to use our own list, do we need to fork this app and install our own fork, or can the list be maintained in Splunk e.g. with a lookup table?

Thanks!

dfederschmidt commented 2 years ago

HI @pirxthepilot

I understand that this behaviour may look unintuitive. But actually, the default for url_parse_extended is the IANA list, which does only contain one level of actual TLDs and should be "sane". The apps documentation regarding this command states the difference and implications when switching to the mozilla list.

Curious to know what you think as well! Would it be worth having a separate list that does not include effective TLDs?

The IANA list is actually the list without effective TLDs. Domains like gov.uk or co.uk are effective TLDs. Technically, the .uk is the proper country-code TLD, even though most people probably want co.uk extracted.

Using your own list would mean forking the app and editing /bin/suffix_list_custom.dat. I agree that this is not ideal, especially in cloud environments and a configuration mechanism via lookup would be more comfortable. I'll add this as a potential enhancement in a dedicated issue.

dfederschmidt commented 2 years ago

Added #5 - feel free to add your thoughts to the issue.

pirxthepilot commented 2 years ago

@dfederschmidt sorry for the delayed response. Your explanation made a lot of sense, and I think it's just a matter of expectation and use case as to what the user wants to accomplish. For that matter, having an easily configurable custom list would be really useful, so thanks for opening #5!

EDIT: Pulled a request to include a new custom list that ships with utbox. I think this will just cause additional confusion and we're better off having custom lists through #5. Will just go ahead and close this issue. Thanks!