document stance on allow-listing registry.k8s.io traffic

BenTheElder commented 2 years ago

We cannot afford to commit to the backing endpoints and details of registry.k8s.io being stable, the project needs to be able to take advantage of whatever resources we have available to us at any given point in time in order to keep the project afloat.

As-is, we're very close to running out of funds and in an emergency state, exceeding our $3M/year GCP credits with container image hosting being a massively dominant cost in excess of 2/3 of our spend. Even in the future when we shift traffic to other platforms using the registry.k8s.io system, we need to remain flexible and should not commit to specific backing details.

E.G. we may be receiving new resources from other vendors, following the current escalation with the CNCF / Governing Board.

We should clearly document, prominently, bolded in this repo's README that we point https://registry.k8s.io to, an explicit stance on this.

We've already had requests to document the exact list of endpoints to allowlist, which is not an expectation we can sustain.

We should also consider giving pointers regarding how end-users can run their own mirrors to:

insulate themselves from shifting implementation details of registry.k8s.io affecting their egress allow-lists
reduce costs to the project
improve reliability for their own clusters (I.E. not depend on the uptime of a volunteer-staffed free registry)

/sig k8s-infra /priority important-soon /kind documentation

hh commented 2 years ago

/cc

upodroid commented 2 years ago

End users get to pull images off the internet for free. For the 0.1% of our users who run K8s on networks with restricted egress, we will share the endpoints from which our images will be served for a particular source IP and 30 days (some other arbitrary window) notice if they change. If that doesn't work for customer X, then they should run mirrors at their own cost. You shouldn't be complaining about services offered for free.

~We can make some uptime guarantees (99.5%) with the tradeoff that we can serve the images from wherever we want and we provide a pre-agreed notice period.~

BenTheElder commented 2 years ago

I don't think we should make any timing guarantees or uptime guarantees. This is free and barely staffed or funded.

At any point if we're ready to take advantage of new infra, we should be free to do so. Users that are sensitive to these changes due to enterprise compliance etc should simply host their own, with guaranteed uptime and stable implementation details and API endpoints.

Even IP addresses we simply may not be able to guarantee. If users need to have extremely tight restrictions on this, they need to sort that out themselves going forward.

I don't think other free OSS package hosts commit to guarantees like this.

BenTheElder commented 2 years ago

cc @dims @ameukam @spiffxp @thockin (k8s infra chairs + leads)

dims commented 2 years ago

I agree Ben.

ameukam commented 2 years ago

The project and the infrastructure is maintained by volunteers at the moment. we should not provide SLAs for the public workloads we host.

I agree with the proposal.

thockin commented 2 years ago

I think this is the only reasonable stance we can take. That said, I like the idea of "giving notice". We can't reach every user or fix it for them, but we CAN give warning.

What if we define a notification mechanism and send notice 2 weeks (or 30 days or something) before we add a new backend? Could be a mailing list or a git repo or a static URL or something that can be monitored. For users who need to know the full set, they can pay attention.

This shifts most of the onus back to those users who know best what they specifically need.

It doesn't need to be IPs (can't be), just the set of hostnames that the proxy might redirect them to for blob backends.

What think?

BenTheElder commented 2 years ago

What if we define a notification mechanism and send notice 2 weeks (or 30 days or something) before we add a new backend? Could be a mailing list or a git repo or a static URL or something that can be monitored. For users who need to know the full set, they can pay attention.

Maybe, however ...

I can't find any precedence for this (other than the implicit, undocumented expectation we unintentionally created around k8s.gcr.io / gcr.io/google-containers).

Dockerhub, pypi, crates.io, the go module proxy, .... none of them appear to do anything remotely like documenting the required endpoints and waiting some period of time before rolling out new ones.

I'm not sure we should go out of our way, to create a new precedent, which then restricts our ability to roll out optimizations (and not just cost, also things like serving out of new regions to both limit cross-region traffic and to reduce latency), and adds additional work for maintaining the registry. We've already added and removed regions multiple times as we've built this out.

aojea commented 2 years ago

What if we define a notification mechanism and send notice 2 weeks (or 30 days or something) before we add a new backend?

is the notification mechanism not going to have the same problem of understaffing? if that happens, having a notification channel that doesn't work is worse than not having nothing IMHO

jhoblitt commented 2 years ago

It seems that an availability guarantee for "backend" endpoint is an anti-feature as:

1) We don't want end-consumers to start depending directly on a specific endpoint. The odds are if a deployment is using a specific endpoint, the dependency will have a life time of weeks to years. 2) It ties our hands operationally for doing load shedding or even taking a misbehaving endpoint offline. 3) In the case where a client has spent minutes to hours trying to pull down a layer because of a poor connection, the end-user should already have strong tolerance to downloads timing out.

BenTheElder commented 2 years ago

Drafted something here: https://github.com/kubernetes/registry.k8s.io/pull/124

kubernetes / registry.k8s.io

document stance on allow-listing registry.k8s.io traffic #122