publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
2.08k stars 1.23k forks source link

Better document the PSL roadmap and needs #671

Open sleevi opened 6 years ago

sleevi commented 6 years ago

If we look at where things historically were, the PSL emerged from three distinct needs:

  1. Registries, particularly ccTLD registries, had registration policies and designs different from how gTLDs were operated. The best example of this is .co.uk, which separated out the .uk namespace into a set of 2LD groupings that were organized similar to how the gTLD namespace was organized at the time: com / net / org translated to co.uk / .net.uk / .org.uk
  2. Registered domains that themselves served as domain registries. This included a spectrum of participants - looking at one of the earlier versions (circa 2010) shows this included Registrars that explicitly registered domains underneath their hierarchy (e.g. CentralNic with ar.com or gb.com - acting as ccTLDs within the .com space, .gb.net in .net, et c.),
  3. Hosting providers - The version from 2010 only had three, AFAICT - operaunite.com, appspot.com, and blogspot.com. This was a rather late addition - two of these were only added in 2010, operaunite.com slightly before then.

The registry data was almost entirely reported by the PSL maintainers, chasing down registry operators. Registered domains acting as domain registries was largely due to CentralNic, a popular Registrar that also operated or partnered with several ccTLDs, and thus the data was incidentally picked up. The third case - hosting providers - was not really imagined in the PSLs creation, although it's come to dominate the number of changes to the PSL today.

The PSL has had some growing pains along the way - the opening of the gTLD space by ICANN meant that self-maintaining registry data was no longer an operation that could be done by the PSL maintainers alone, because the sheer number of new registries prevented the effective and ongoing maintenance of that. Registries started to be added by script, and the manual curation of existing records no longer became a thing much dedicated time was spent towards.

A number of dynamic DNS providers were added, which are in a similar-but-not-identical case as the second - there's generally not WHOIS services being provided, registration policies are a bit ad-hoc, but both are aligned in that they provide vanity suffices for registrants.

The growth of Internet services (and the centralization onto common platforms) has driven a significant amount of churn in the third case. New providers come up and old providers wither away, and the maintenance of that list is done almost exclusively based on self-reporting, with some basic automation before addition (the TXT records), since it's no longer possible to scale the previously investigative-analysis that every PSL change got.

As the PSL itself has grown, consumers have had to dramatically alter how they consume they list - filtering out some use cases (such as the third), pushing for more information to be included for the first two use cases, or even rewriting the data structures used, going from static lists to hash lists to tries (compressed or full). Each big growth spurt of the PSL has forced some change for consumers.

Similarly, the adoption of the third case has increased the rate of change in the PSL. While previously the first case could be largely met by a static list updated annually, supporting the second and third cases mean that changes on the order of days are at times necessary for consumers, as otherwise domain holders can't use certain features or they don't work correctly.

The PSL is thus at an inflection point - supporting all of these use cases means that its pace of change and its growth rate are no longer sustainable for the use cases and consumers it supports, and every new use of it brings greater overall risk into the ecosystem.

We thus need to figure out a roadmap for how the PSL will be maintained and scale, what use cases it will consider and not consider, and if and how to wean existing consumers off it, in the search for better solutions.

pzb commented 6 years ago

In addition to your list of issues, I would add the failure to rely upon * as the default rule has caused numerous issues. With the gTLD program, full TLDs are coming and going way more often than once a year. Because the PSL has included all TLDs, even if they are simply duplicative of the * rule, it is being used by programs in lieu of the root zone file to get a TLD list.

dnsguru commented 4 years ago

@sleevi @weppos I haven't a sense that we have gotten anywhere on this (likely due to #dayjobs), but I have made some great headway with respect to how we engage with other entities in the ICANN and domain space.

I have worked with the ICANN Office of the CTO team on helping create a document to be distributed within the IANA to ccTLD and gTLD administrators to help elevate their awareness of the PSL, and we'll be presenting this at the ICANN 67 meeting in Cancun, Mexico in March of 2020.

I believe that this will help us with an improved quality of the requests that come in to the ICANN section at the top of the file.

Where I think we're suffering is the PRIVATE section and the increasing volume of requests that are hitting us. We'd proposed splitting the file at that horizon, and I think it is a good idea.

IF we did that, we need to prepare people for it. It seems to me the place that would get folks to notice would be in the header sections or adding a new comment line or two in the file itself.

sleevi commented 4 years ago

Jothan: It might be useful to focus on the problem you’d like to solve, rather than the solution.

Both as a maintainer and as a consumer, I don’t believe there is any benefit to be had at all from splitting the file, and that it would do more harm than good. That said, I’m probably missing something important that you’re concerned about, and so I’d want to make sure we got that documented, before discussing a solution.

It would probably be good to open up as a separate issue for the specific problem(s) you see, which we can reference here, so that the roadmap solution is “Solve Problem X” rather than “Do Thing Y”

peterthomassen commented 4 years ago

I suppose it would be better to not split the list, unless there is a demand by those who want to treat the sections differently (there is a CA use case given on the PSL web site). If there is no such demand, why bother?

dnsguru commented 4 years ago

Jothan: It might be useful to focus on the problem you’d like to solve, rather than the solution.

I'll back off on the idea... that's just a bias for results within me fighting to help this project thrive. @sleevi sounds like maybe splitting the file would not be something we would place in the roadmap. I had seen the scaling issue represented as a design concern, and candidly the PUBLIC section seems like it is where the majority of the expansion is occurring. One the one hand I see expressed that the file size is increasing and as I review the PR / Issues, the majority seem to be focused in the PRIVATE section. If this is less of an issue than I see, I really don't have a hill to defend or die on for this.

Both as a maintainer and as a consumer, I don’t believe there is any benefit to be had at all from splitting the file, and that it would do more harm than good.

Completely see this perspective. I would not want in any way to introduce disruption.

That said, I’m probably missing something important that you’re concerned about, and so I’d want to make sure we got that documented, before discussing a solution. It would probably be good to open up as a separate issue for the specific problem(s) you see, which we can reference here, so that the roadmap solution is “Solve Problem X” rather than “Do Thing Y”

I think the challenge here, for all of us as volunteers, is the #dayjobs vs time to invest in the architectural stuff and writing up documentation.

Clearly, though we have a mailing list and the ability to communicate via github or dm, we can discuss things, but I wonder if we might benefit from some form of ability to announce stuff like changes or proposals and or poll the integrators/users about their biases.

sleevi commented 4 years ago

I had seen the scaling issue represented as a design concern, and candidly the PUBLIC section seems like it is where the majority of the expansion is occurring. One the one hand I see expressed that the file size is increasing and as I review the PR / Issues, the majority seem to be focused in the PRIVATE section. If this is less of an issue than I see, I really don't have a hill to defend or die on for this.

Right, every known consumer wants both, so splitting doesn’t solve any problems for consumers. It also doesn’t reduce the number of PRs, and having changes go to different files just increases complexity without compelling benefits (at least, AIUI; if there are overlooked benefits, we should nail them down)

Clearly, though we have a mailing list and the ability to communicate via github or dm, we can discuss things, but I wonder if we might benefit from some form of ability to announce stuff like changes or proposals and or poll the integrators/users about their biases.

It seems like we have that already, as you mention? It’s not clear to me what would be missing in that?

dnsguru commented 4 years ago

From my POV we closed the discussion on splitting the file into two sections - just using my leaf blower on the remnants of the chalk dust from the outline of that horse.

Moving on

...Announcements/Polls

It seems like we have that already, as you mention? It’s not clear to me what would be missing in that?

To answer that, lets journey back to the initial issue -

We thus need to figure out a roadmap for how the PSL will be maintained and scale, what use cases it will consider and not consider, and if and how to wean existing consumers off it, in the search for better solutions.

IF we embark on that type of roadmap dialog, should we not engage the integrators, users and consumers of the PSL? I assert that most of them blindly download the .dat file and are not on mailing lists or monitoring this on github.

I am not saying or recommending we do it, but it seems that the most effective manner to reach the largest number of PSL interested parties might be to tweak the file header to include announcements of some sort in a manner that would let us engage them w/o breaking stuff.

dnsguru commented 4 years ago

I am closing a few lingering issue reports, and have caught some meta issues that I'll document in issues which we could incorporate into a roadmap concept and reference them in this Issue

weppos commented 4 years ago

I agree with @sleevi that splitting would not solve the problem of the management of the private section. May solve other problems, but I'm find myself in great agreement with @sleevi statement

“Solve Problem X” rather than “Do Thing Y”

I strongly believe automating the submission and validation process is the key. I do have some proposals on how to make it happen leveraging a slightly revised version of the DNS validation we use today. I hope to be able to find the time to make a prototype.

In short terms, I'd like to:

  1. Adjust the current DNS-based validation to be self-referencing: the main blocker for an automated process is that the DNS entry we require today references a GitHub ticket. You need to have a ticket to add the DNS entry, and we often see people opening the PR with the changes, getting the ID, then adding the record. This is not practical from the automation POV. Ideally, the DNS entry should be self-referencing so that any automated tool can validate it.
  2. Build a tool that can be run, given a set of hostname, and that will perform the necessary DNS-based validation (similar to what Let's Encrypt is doing today with the DNS challenge... although much simpler)
  3. Automate the validation, and possibly the submission
dnsguru commented 4 years ago

I strongly believe automating the submission and validation process is the key. I do have some proposals on how to make it happen leveraging a slightly revised version of the DNS validation we use today.

Could this be leveraged for automation of removals at some point?

dnsguru commented 4 years ago

Adjust the current DNS-based validation to be self-referencing

What would this look like? The LE process deals with a "token" they provide which for all intents and purposes is the # of the PR within the _PSL txt record currently helps indicate to me that there is a tether to the PR

sleevi commented 4 years ago

The Git hash of the modified version of the PSL, for example. You could compute that prior to submitting the PR by making your modifications against the current HEAD.

dnsguru commented 4 years ago

I am all for automating as much as we can, and leveraging the DNS infrastructure where possible for it.

Not trying to go too far down the road on being prescriptive but the RFC 8552 stuff that a zone admin might add could hold a txt record that matches the git user handle so we could know who's an authoritative rep

peterthomassen commented 4 years ago

The Git hash of the modified version of the PSL, for example. You could compute that prior to submitting the PR by making your modifications against the current HEAD.

I suppose the objective is that such a string would never end up in the DNS, unless so intended by an authorized admin. That can also be achieved by using a hash of the concatenation of psl: and the suffix, maybe with a version tag. That way, the hash does not depend on the state of HEAD, but only on the suffix itself, which would appear more stable to me.

sleevi commented 4 years ago

I suppose the objective is that such a string would never end up in the DNS, unless so intended by an authorized admin. That can also be achieved by using a hash of the concatenation of psl: and the suffix, maybe with a version tag. That way, the hash does not depend on the state of HEAD, but only on the suffix itself, which would appear more stable to me.

I don’t think stability is necessarily a goal here. The goal is to be able to quickly authenticate a pull request, which is why the current method uses the PR number. It’s fairly common for a PR to modify multiple domains.

To be clear, it is not that someone needs to continually be updating this value. The primary objective is merely authenticating the PR.

weppos commented 4 years ago

Could this be leveraged for automation of removals at some point?

Possibly, yes.

Regarding the how is going to work, I'm working to get a proposal out for feedback. I am trying to stay away to use anything connected to how we manage the list. In other words, using something related to Git will make the process strictly tied to how we use Git today, similar to the fact the DNS TXT today references the GitHub repo.

I am more inclined to find something that doesn't require any extra shared state besides the suggested PSL change and the hostname itself. That would be sufficient, in combination with the fact the authentication is ultimately whether the user can edit the DNS records or not.

peterthomassen commented 4 years ago

I suppose the objective is that such a string would never end up in the DNS, unless so intended by an authorized admin. That can also be achieved by using a hash of the concatenation of psl: and the suffix, maybe with a version tag. That way, the hash does not depend on the state of HEAD, but only on the suffix itself, which would appear more stable to me.

I don’t think stability is necessarily a goal here. The goal is to be able to quickly authenticate a pull request, which is why the current method uses the PR number. It’s fairly common for a PR to modify multiple domains.

True. In an earlier comment, it was said that the PR number should be replaced by something self-referencing. The question is what "self" should be: Should it identify the change, or should it identify the candidate public suffix at which the record is added?

The latter has the advantage that, if changes in a PR are required, that would not invalidate the verification records configured in the DNS prior to PR submission; they could be reused for a changed or even a completely new PR. That is what I meant with stability; I did not mean long-term stability for continued verification.

The same goal can be achieved by allowing the hash of any commit within the PR as a verification token. The invalidation problem upon PR changes can then be avoided by adding changes as new commits, squashing them at merge time. However, I think that's more complicated for users.

I have no stakes in this, I simply proposed this because I thought it covers the requirements (as far as they are known to me) and is suitable to reach the goal with minimal friction.

benaubin commented 4 years ago

A significant majority of the entries in the PRIVATE section of the list are simply entries of a domain without any use of the list's special features or syntax. Inclusion on the list is simply used as a signal that subdomains are untrusted, mostly for cookie security.

List entry is basically a reflection / descriptor of domain's DNS configuration. To enable automatic updates to the list, each domains entry could be stored inside of a TXT record on the domain.

An automated system could automatically update the list by checking the TXT record. There would be no need for tokens - presence of the record would be enough authentication to indicate authorization.

The ability to manage DNS is enough to indicate intent to be on the list, as having the authorization to manage DNS is the authorization you need in order to manage DNS records of subdomains.

Further, I don't necessarily think there needs to be a central list of private domains. I can't think of any reasonable use case where the PRIVATE list is used for anything except lookups - there's no need for enumeration.

We could instead standardize a DNS record indicating the status of a domain as a "public suffix." I'm not 100% sure what to call it, but maybe something like PUDI (for "public unrestricted domain issuance") or maybe DTP ("domain trust policy"). Consumers of the current list would instead retrieve the record from DNS.

This wouldn't be a ton of overhead - DNS is very fast and designed for pretty much this use-case (at its essence, it's a distributed hosts list).

For example, on first connection to a domain, browsers could request and cache that record, and use its value to enforce cross-origin policies. DNS lookups add relatively low latency to requests (which already require a network connection), and the result is cacheable.

I can't think of a use-case where enumeration or offline lookups are required - and I think standardizing a DNS record would be a much more maintainable strategy.

However, using DNS records as a basis for generating the list would mean no complex authorization schemes. All that would be required would be submitting a domain to an automated system which compiles the list from the records. However, there'd have to be thought put in to anti-spam/abuse of the automated system.

sleevi commented 4 years ago

Ben: Thanks for your suggestions! It might be worthwhile to visit the archives of the IETF DBOUND mailing list, which explored what you proposed and looked at the real tradeoffs different client use cases had to be concerned about.

You might also find https://github.com/sleevi/psl-problems/blob/master/README.md helpful for historic context about why the list in its current form exists.

Hope this helps!

benaubin commented 4 years ago

Thanks for the links, @sleevi! The design of the internet is fascinating to me. Especially interested in HTTP State Tokens as an alternative to cookies.

Will definitely read more of the archives from the DBOUND list, but I'm glad more experienced people than myself have already considered that option.

Anyways, would it be possible to use similar dns records to automate maintenance of the list itself? That would at least answer the "when can we remove entries?" question.

However, I understand additions to the list are obviously much more difficult in order to prevent abuse. What there was an automated system which charged a nominal fee ($25?) similar to domain registration? The funds could go to IETF or a similar non-polarizing public-benefit organization and used to disuade people from using the list to circumvent things such as LetsEncrypt's rate limiting and, with guidance, help a requestor better understand the economic impacts of addition. A manual review option could be available for projects who could not afford the fee.

Plus, money changing hands through conventional means almost always leads to auditable identity and accountability in cases of abuse.

peterthomassen commented 4 years ago

Consumers of the current list would instead retrieve the record from DNS.

While your point is the operation of the PSL through the DNS, I wanted to point out that consumers can already use the DNS for querying the PSL, see https://publicsuffix.zone/.

sleevi commented 4 years ago

Thanks again for your suggestions.

There are zero plans to ever charge for the list, and it would be counter to the goals and ethos that created that list, just like we would not charge to submit open-source patches.

Automation has been heavily discussed by the PSL maintainers (and I believe some of that is archived in the psl-discuss@ mailing list). It can indeed be helpful, but does not solve the removal problem without having all domains on an automated solution. I believe the discussion around those tradeoffs was public, although it may have been on the older, maintainers-only mail list unfortunately.

benaubin commented 4 years ago

@sleevi That makes a lot of sense. I admire the work y'all do.

Let me know if there's anything I can do to help.

dnsguru commented 4 years ago

Thanks, Ben - ideas and energy always appreciated.

The big challenge in making any changes come from the diversity of usage and expectations of status quo that are present in use cases out there in the wild.

We are seeing where let's encrypt or other ca are using PSL as a fast fail on requests, and other services do something similar with respect to domain behaviors.

Depending upon the use case, the list may be used in part or in whole in a variety of ways, and we got stuck in dbound because it was challenging to even define a "public suffix" due to the diversity of use cases.

Some wanted to have a top down authority chain from the root, akin to DNSSEC. This breaks some use cases.

Some wanted to publish all their info from their zone. This might work, but without a 100% replacement that is backward compatible it would mean the costs of ioerating a parallel system and dealing with synching plus authority issues.

We are volunteers here, and are not looking to introduce less opportunity to spend time with families or distract from day jobs if one is fortunate enough to have one right now.

So the pr# txt _psl.foo.bar gives us validation/verification for now. Automation of this would be helpful in speeding things up, but it does cause a quick human review of the rationale of the request and validation of the dns.

We want to have some connective tissue between the PR and the administration of a domain name. The PR is specific to a given change, so it is a reasonable assurance of verification.

Though it may sound like a bias on ensuring technical review was present, adding the txt record tends to trigger a technical scrutiny that helps demonstrate that someone will be aware of the impacts of adding or removing an entry will put some thought to it.

With respect to money being collected... I also believe, like Ryan, that introducing charging money at this time would not be a good idea.

It may set expectations, and also might advantage or disadvantage some, and it has not been something that has a cost other than git ability and patience.

Perhaps in general, my only shift in this attitude about charging would be ensuring that the hosting of the list is covered in the future, should github change their model, or if there were costs that need covering to evolve the service.

-J

On Tue, Apr 14, 2020, 8:14 AM Ben Aubin notifications@github.com wrote:

@sleevi https://github.com/sleevi That makes a lot of sense. I admire the work y'all do.

Let me know if there's anything I can do to help.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/publicsuffix/list/issues/671#issuecomment-613502602, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQTJIWK4KLYJCDJ7WVAP3RMR4WJANCNFSM4FBYIEKQ .

dnsguru commented 4 years ago

Updates: See