hakluke / haktldextract

Extract domains/subdomains from URLs en masse
131 stars 29 forks source link

Cannot extract domain from amazon #1

Closed roarnyg closed 3 years ago

roarnyg commented 3 years ago

Looks like Haktldextract struggles when working with amazonas.com domains. See example:

$ echo flyeagle-awstest-1895000007.us-east-1.elb.amazonaws.com | haktldextract
flyeagle-awstest-1895000007.us-east-1.elb.amazonaws.com

and

$ echo elb.amazonaws.com | haktldextract 
.

just a dot was returned.

Are there any fixes for this?

roarnyg commented 3 years ago

I found similar exceptions on this domains, too:

cb25b8a2-b691-4fcb-ba15-078b134d7d6b.cloudapp.net d6uz09z33tnrv.cloudfront.net dts-eastus-qa.azurewebsites.net

I really like your tool, and hope there will be a fix.

hakluke commented 3 years ago

That is so bizarre... I'll look into it - thanks @roarnyg

hakluke commented 3 years ago

Aaaah I have found the problem - the tldextract library has these domains as TLDs for some reason.. I'll see if I can figure out a fix.

hakluke commented 3 years ago

Hey - I had to fork the tldextract library and I changed it to use the IANA TLD data here https://data.iana.org/TLD/tlds-alpha-by-domain.txt instead of the public suffix data, which now includes a bunch of things that aren't strictly TLDs, including root domains of cloud providers. Should all be fixed now - just update using go get -u github.com/hakluke/haktldextract.

hakluke commented 3 years ago

Test output:

~$ echo flyeagle-awstest-1895000007.us-east-1.elb.amazonaws.com | haktldextract
amazonaws.com
~$ echo "cb25b8a2-b691-4fcb-ba15-078b134d7d6b.cloudapp.net
d6uz09z33tnrv.cloudfront.net
dts-eastus-qa.azurewebsites.net" | haktldextract
cloudapp.net
azurewebsites.net
cloudfront.net
~$
roarnyg commented 3 years ago

Thanx for replying and fixing so fast. Good work! I ran that go get -u github.com/hakluke/haktldextract command, but I only get the same results as before. It can be that I have to restart or flush dns, anyways. I'll try that later.

hakluke commented 3 years ago

Hey again, sorry I forgot an important step, you will also need to rm /tmp/tld.cache ... The filename may be slightly different, I can't quite remember!