jpillora / go-tld

TLD Parser in Go
MIT License
125 stars 18 forks source link

grep -v blogspot #1

Closed willglynn closed 9 years ago

willglynn commented 9 years ago

I'm familiar with this list, and I got confused when I found this in build process: … | grep -v blogspot

The upstream list is a list of TLDs suitable for use in cookie logic. evil.com shouldn't be able to set cookies for com (thus reaching yourbank.com). Similarly, evilsite.co.uk shouldn't be able to set cookies for co.uk. This list exists because which suffixes represent boundaries and which aren't isn't at all obvious.

For example, .us used to be government-only, delegated to states and then to certain state entities, despite now being open for public registrations. You don't want evil.k12.ak.us to be able to set cookies for good.k12.ak.us, since they're administratively distinct:

// The registrar notes several more specific domains available in each state,
// such as state.*.us, dst.*.us, etc., but resolution of these is somewhat
// haphazard; in some states these domains resolve as addresses, while in others
// only subdomains are available, or even nothing at all. We include the
// most common ones where it's clear that different sites are different
// entities.
k12.ak.us
k12.al.us
k12.ar.us
k12.as.us
k12.az.us
k12.ca.us
k12.co.us
k12.ct.us
k12.dc.us
k12.de.us
k12.fl.us
k12.ga.us
k12.gu.us

…but there are exceptions:

// k12.hi.us  Bug 614565 - Hawaii has a state-wide DOE login

Similarly, a.com shouldn't be able to set cookies for b.com – but even there are deeper divisions relevant to trust:

// ===END ICANN DOMAINS===
// ===BEGIN PRIVATE DOMAINS===

// Amazon CloudFront : https://aws.amazon.com/cloudfront/
// Submitted by Donavan Miller <donavanm@amazon.com> 2013-03-22
cloudfront.net

// Amazon Elastic Compute Cloud: https://aws.amazon.com/ec2/
// Submitted by Osman Surkatty <osmans@amazon.com> 2014-12-16
ap-northeast-1.compute.amazonaws.com
…

// Amazon Elastic Beanstalk : https://aws.amazon.com/elasticbeanstalk/
// Submitted by Adam Stein <astein@amazon.com> 2013-04-02
elasticbeanstalk.com

// Amazon Elastic Load Balancing : https://aws.amazon.com/elasticloadbalancing/
// Submitted by Scott Vidmar <svidmar@amazon.com> 2013-03-27
elb.amazonaws.com

// Amazon S3 : https://aws.amazon.com/s3/
// Submitted by Courtney Eckhardt <coec@amazon.com> 2013-03-22
…

// GitHub, Inc.
// Submitted by Ben Toews <btoews@github.com> 2014-02-06
github.io
githubusercontent.com

…

// Heroku : https://www.heroku.com/
// Submitted by Tom Maher <tmaher@heroku.com> 2013-05-02
herokuapp.com
herokussl.com

…and yes, this list includes blogspot.com and friends. Why is Blogspot singled out for removal but all the other private domains retained?

The build process also runs it through grep "^[a-z]", which removes the ICANN-assigned IDN TLDs, which also seems wrong.

jpillora commented 9 years ago

Ah whoops, that's a big. I removed those two sets for debugging the binary search, it should be changed to just grep out empty lines and comments. Since I'm just using it for TLDs do you know off a more concise version which doesn't include the cookie specific special cases?

On Tuesday, January 27, 2015, Will Glynn notifications@github.com wrote:

I'm familiar with this list, and I got confused when I found this in build process: … | grep -v blogspot https://github.com/jpillora/go-tld/blob/master/generate.sh#L6

The upstream list is a list of TLDs suitable for use in cookie logic. evil.com shouldn't be able to set cookies for com (thus reaching yourbank.com). Similarly, evilsite.co.uk shouldn't be able to set cookies for co.uk. This list exists because which suffixes represent boundaries and which aren't isn't at all obvious.

For example, .us used to be government-only, delegated to states and then to certain state entities, despite now being open for public registrations. You don't want evil.k12.ak.us to be able to set cookies for good.k12.ak.us, since they're administratively distinct:

// The registrar notes several more specific domains available in each state, // such as state..us, dst..us, etc., but resolution of these is somewhat // haphazard; in some states these domains resolve as addresses, while in others // only subdomains are available, or even nothing at all. We include the // most common ones where it's clear that different sites are different // entities.k12.ak.usk12.al.usk12.ar.usk12.as.usk12.az.usk12.ca.usk12.co.usk12.ct.usk12.dc.usk12.de.usk12.fl.usk12.ga.usk12.gu.us

…but there are exceptions:

// k12.hi.us Bug 614565 - Hawaii has a state-wide DOE login

Similarly, a.com shouldn't be able to set cookies for b.com – but even there are deeper divisions relevant to trust:

// ===END ICANN DOMAINS=== // ===BEGIN PRIVATE DOMAINS===

// Amazon CloudFront : https://aws.amazon.com/cloudfront/ // Submitted by Donavan Miller <donavanm@amazon.com javascript:_e(%7B%7D,'cvml','donavanm@amazon.com');> 2013-03-22cloudfront.net

// Amazon Elastic Compute Cloud: https://aws.amazon.com/ec2/ // Submitted by Osman Surkatty <osmans@amazon.com javascript:_e(%7B%7D,'cvml','osmans@amazon.com');> 2014-12-16ap-northeast-1.compute.amazonaws.com …

// Amazon Elastic Beanstalk : https://aws.amazon.com/elasticbeanstalk/ // Submitted by Adam Stein <astein@amazon.com javascript:_e(%7B%7D,'cvml','astein@amazon.com');> 2013-04-02elasticbeanstalk.com

// Amazon Elastic Load Balancing : https://aws.amazon.com/elasticloadbalancing/ // Submitted by Scott Vidmar <svidmar@amazon.com javascript:_e(%7B%7D,'cvml','svidmar@amazon.com');> 2013-03-27elb.amazonaws.com

// Amazon S3 : https://aws.amazon.com/s3/ // Submitted by Courtney Eckhardt <coec@amazon.com javascript:_e(%7B%7D,'cvml','coec@amazon.com');> 2013-03-22 …

// GitHub, Inc. // Submitted by Ben Toews <btoews@github.com javascript:_e(%7B%7D,'cvml','btoews@github.com');> 2014-02-06github.iogithubusercontent.com

// Heroku : https://www.heroku.com/ // Submitted by Tom Maher <tmaher@heroku.com javascript:_e(%7B%7D,'cvml','tmaher@heroku.com');> 2013-05-02herokuapp.comherokussl.com

…and yes, this list includes blogspot.com and friends. Why is Blogspot singled out for removal but all the other private domains retained?

The build process also runs it through grep "^[a-z]", which removes the ICANN-assigned IDN https://en.wikipedia.org/wiki/Internationalized_domain_name TLDs, which also seems wrong.

— Reply to this email directly or view it on GitHub https://github.com/jpillora/go-tld/issues/1.

willglynn commented 9 years ago

Thing is, all this depends on how you define "TLD". The literal interpretation is "top-level domain", which in the case of google.co.uk is uk. If you want co.uk to be returned instead, you need to loosen that definition.

So: what does "TLD" mean to you?

jpillora commented 9 years ago

I guess I always thought the domain name you purchase is suffixed by the TLD - which is what I mean by TLD.

My use case is determining whether a supplied domain is the root (not sure if that's the right word now lol). For example, is foo.uk.com the root level? No. Is foo.co.uk? Yes.

So the entity which runs .uk could allow people to register .anything.uk domains, except they don't?

On Tuesday, January 27, 2015, Will Glynn notifications@github.com wrote:

Thing is, all this depends on how you define "TLD". The literal interpretation is "top-level domain", which in the case of google.co.uk is uk. If you want co.uk to be returned instead, you need to loosen that definition.

So: what does "TLD" mean to you?

— Reply to this email directly or view it on GitHub https://github.com/jpillora/go-tld/issues/1#issuecomment-71493600.

willglynn commented 9 years ago

So the entity which runs .uk could allow people to register .anything.uk domains, except they don't?

Exactly. .uk is the top-level domain, and registrations within it are handled by policy, not by protocol.

.us is a good case study here. It used to be that .us contained .ak.us through .wy.us, and you could only register domains there, e.g. haines.ak.us or laramie.wy.us. By the same logic as co.uk, you'd probably want to return ak.us and wy.us for those domains – despite the fact that a more recent policy also permits direct registrations like somedomain.us, which should return us.

Going back to .uk: business entities normally register domains within .co.uk, yes. This is because the registry for the .uk TLD defines .co.uk and other second-level domains, and forbids direct registrations within .uk. It's your intention that google.co.uk should return co.uk, which makes sense. However, gov.uk defines policies for registering *.service.gov.uk – since .gov.uk would be considered a "TLD" since it's delegated by the .uk registry, service.gov.uk should also be considered a TLD since it's delegated by the gov.uk registry.

This is the logic behind the private entries in that list: anyone can heroku create something, giving them something.herokuapp.com. Creating a Heroku app assigns you that slice of DNS, just like creating an S3 bucket named foo assigns you foo.s3.amazonaws.com, just like signing up for GitHub gives you jpillora.github.io. You can serve whatever HTML you want from any of those, just like if you bought a domain inside .com, .co.uk, .us, or .wy.us.

As far as "what people can buy", note that money can also buy literal top-level domains, e.g. google – which is live, though http://google/ does not yet send users to a search page.

jpillora commented 9 years ago

Hey Will, sorry for the delay – posted this late last. I just noticed, that golang.org/x has https://godoc.org/golang.org/x/net/publicsuffix so I should just use that and I'm now directing people from the README to there aswell. Thanks for the write up though, there's much to domain names that I thought, very helpful :)