kubernetes / k8s.io

Code and configuration to manage Kubernetes project infrastructure, including various *.k8s.io sites
https://git.k8s.io/community/sig-k8s-infra
Apache License 2.0
737 stars 824 forks source link

Sunset for k8s.gcr.io repository #4872

Closed dims closed 10 months ago

dims commented 1 year ago

Here are the community blogs and announcements so far around k8s.gcr.io

However we are finding out that the numbers don't add up and we will end up using all the budget we have as our GCP cloud credits well before Dec 31, 2023. So we need to do something more drastic than just the freeze. Please see the thread in #sig-k8s-infra: https://kubernetes.slack.com/archives/CCK68P2Q2/p1677793138667629?thread_ts=1677709804.935919&cid=CCK68P2Q2

We will need to start by enumerating some of images that carry the biggest cost (storage+network) and removing them from k8s.gcr.io right away (possibly by freeze date - April 3rd). Some data is in the thread, but we will need to revisit the logs and come up with a clear set of images based on some criteria, announce their deletion as well. Note that these specific set of images will still be available in the new registry registry.k8s.io So folks will have to fix their kubernetes manifests / helm charts etc as we mentioned in the 3 urls above.

Thought about deadline for deletion of k8s.gcr.io: Since the freeze is on April 3rd 2023 (10 days before 1.27 is released) and we expect to send comms out at kubecon EU ( 18 – 21 APRIL ). How about we put the marker on end of June? (So we get 6 months of cost savings on the costs)

Risk: We will end up interrupting clusters that are working right now. Specifically given the traffic patterns, a bunch of these will be in AWS, but is very likely to be anyone who has an older working cluster that they haven't touched in a while.

What i have enumerated above is just the beginning of the discussion. Please feel free to add your thought below, so we can then draft a KEP around it.

dims commented 1 year ago

cc @kubernetes/sig-architecture-leads @kubernetes/sig-release-leads

enj commented 1 year ago

@dims can we start by having brownouts of the old registry (they should start immediately)?

sftim commented 1 year ago

Let's aim to very clearly communicate a recommended approach (eg: mirror the images that you depend on, or use a pull through cache, or...) and consider the lead time on those comms when we pick a date.

The comms plan does not have to be perfect, it just has to be good enough.

dims commented 1 year ago

@sftim agree. Recommended approach, so far:

dims commented 1 year ago

@dims can we start by having brownouts of the old registry (they should start immediately)?

@enj yep, agree. The brownout we had in mind was as Arnaud mentioned here: https://kubernetes.slack.com/archives/CCK68P2Q2/p1677793564552829?thread_ts=1677709804.935919&cid=CCK68P2Q2

enj commented 1 year ago

@enj yep, agree. The brownout we had in mind was as Arnaud mentioned here: https://kubernetes.slack.com/archives/CCK68P2Q2/p1677793564552829?thread_ts=1677709804.935919&cid=CCK68P2Q2

@dims I suppose deleting images is one form of brownout... I was more thinking that we have the old registry return 429 errors every day at noon for a few hours. The transient service disruption will get people's attention.

dims commented 1 year ago

@enj k8s.gcr.io is GCR based and has only a few folks left to take care of it. Last year some helpful folks tried to setup redirects (automatic from k8s.gcr.io to registry.k8s.io) in a small portion and ran into snags, so we can't do much over there other than delete images.

Details are in this thread: https://kubernetes.slack.com/archives/CCK68P2Q2/p1666725317568709

enj commented 1 year ago

@dims makes sense. One suggestion that may also not be implementable would be to temporarily delete and then recreate image tags to cause pull failures (another form of brownout).

dims commented 1 year ago

Year to date GCP Billing data, please see here: GCP_Billing_Report-year-to-date.pdf

($682,683.81 year-to-date / 62 days from Jan 1 to March 4) * 365 = $4,019,025.65 (our budget/credits is $3m )

sftim commented 1 year ago

(edited)

One option we have is to actually delete some images - and then optionally reinstate them per https://github.com/kubernetes/k8s.io/issues/4872#issuecomment-1454951290. A 429 is subject to Google's say-so, but deleting an image is something we can Just Do™. So long as the comms are in place to explain why.

dims commented 1 year ago

@sftim yes, we will have a list of limited set of images that we will delete ASAP! (and will NOT reinstate them). @hh and folks are coming up with the high traffic / costly image list as the first step. Our comms will depend on what's in that list.

dims commented 1 year ago

xref: https://github.com/kubernetes/k8s.io/issues/4738

dims commented 1 year ago

An energetic discussion with @thockin here https://kubernetes.slack.com/archives/CCK68P2Q2/p1678118252030639

BenTheElder commented 1 year ago

I think we can do broad brownouts ahead of any final sunset by toggling the access controls on the 3 backing GCR instances. To make the images public read we set the backing GCS bucket to have read permission for allUsers, we could probably invert that and put it back on a schedule to gradually increase the period of total non-availability.

Doing this is a big deal, and I'm not sure what the time frame should be. We know that users are very slow to migrate, and that doing this will disrupt their base ""cloud-native"" infrastructure. (E.G. I saw some recent data that Kubernetes 1.11 from 2018 is still reasonably popular (!))

dims commented 1 year ago

Some data from @justinsb:

image

image

dims commented 1 year ago

Some good discussion with @TheFoxAtWork here: https://cloud-native.slack.com/archives/CSCPTLTPE/p1678219030800149 on #tag-chairs channel on CNCF slack

This will likely break a lot of clusters and organizations, but it is certainly a good wake up call to the world that even open source has its costs. I know this is drastic, but we’ve broken the internet before, this one at least is more well coordinated with plenty of advance warnings. We can’t go to everyone personally, so we do our best with the time and energy we have available to us as open source volunteers and community members. Side note, eliminating older versions and forcing upgrades is a huge global security uplift.

I would also recommend (though this is likely already done) to work with the Ambassadors, Marketing Team, and other Foundations.
TheFoxAtWork commented 1 year ago

@dims i want to confirm what i'm looking at from the chart (i understand there is a new one in the works), can you confirm that each colored bar is who/what is primarily requesting the images? If this is the case, has AWS/Amazon been engaged to redirect requests they field to registry.k8s.io ? have we done this with other cloud providers? ( i know i'm late to the party trying to understand what has already been completed)

chris-short commented 1 year ago

@dims @rothgar and I are engaging folks on the AWS side.

dims commented 1 year ago

@TheFoxAtWork yep, there has been a bunch of back and forth.

chris-short commented 1 year ago

Has anyone pinged Microsoft? I don't know where Azure stands at the moment.

dims commented 1 year ago

A single line kubectl command to find images from the old registry:

A Kyverno and Gatekeeper policy to help folks!

A kubectl/krew plugin:

dims commented 1 year ago

FAQ(s) we are getting asked:

TheFoxAtWork commented 1 year ago

attempted to pull a lot of the details from this ticket into a single LinkedIn post for sharing in case it helps: https://www.linkedin.com/posts/themoxiefox_action-required-update-references-from-activity-7039245748525256704-IrES

dims commented 1 year ago

Some good news from @BenTheElder here - https://kubernetes.slack.com/archives/CCK68P2Q2/p1678299674725429

image

chris-short commented 1 year ago

AWS just posted a bulletin in its StackOverflow: https://stackoverflow.com/collectives/aws/bulletins/75676424/important-kubernetes-registry-changes

chris-short commented 1 year ago

I chatted with @jeremyrickard at Microsoft. They are all over this.

dims commented 1 year ago

Question: when the new k8s.gcr.io->registry.k8s.io redirection takes effect, what is likely to fail?

thockin commented 1 year ago

Touching on the topic of network level firewalls or other things causing impact:

This is fairly easily tested - run a pod which uses a "registry.k8s.io" image in your cluster(s). If it is able to pull that image, you're almost certainly OK. If not, debug now before the redirect goes live (next week, we hope).

recollir commented 1 year ago

How will the redirect work? Just on DNS level? I have tried this locally myself, but containerd/Docker, obviously and for the right reasons, complains about certificate mismatch between k8s.gcr.io and registry.k8s.io. I solved it then by downloading the ca.crt and installing it locally for containerd/Docker.

tuapuikia commented 1 year ago

Some good news from @BenTheElder here - https://kubernetes.slack.com/archives/CCK68P2Q2/p1678299674725429

image

Do we have enough bandwidth on registry.k8s.io ?

BenTheElder commented 1 year ago

How will the redirect work? Just on DNS level? I have tried this locally myself, but containerd/Docker, obviously and for the right reasons, complains about certificate mismatch between k8s.gcr.io and registry.k8s.io. I solved it then by downloading the ca.crt and installing it locally for containerd/Docker.

HTTP 3XX redirect, not DNS. No cert changes.

You can test by taking any image you would pull and substituting registry.k8s.io instead of k8s.gcr.io. All images in k8s.gcr.io are in registry.k8s.io.

The only difference between doing this test and the redirect will be your client reaching k8s.gcr.io first and then following the redirect, but presumably k8s.gcr.io was already reachable for you if you're switching, and all production-grade registry clients follow HTTP redirects.

The same existing GCR endpoint will serve the redirect instead of the usual response. Existing GCR image pulls already involve redirects to backing storage, just not redirects to registry.k8s.io

Do we have enough bandwidth on registry.k8s.io ?

We should have more than enough capacity on https://registry.k8s.io, we've looked at traffic levels for k8s.gcr.io and planned accordingly. We aren't hitting bandwidth limits on GCR either, just impractical cost of serving ever-increasing cross-cloud bandwidth.

registry.k8s.io gives us the ability to offload bandwidth-intensive image layer serving to additional hosts securely. We're doing that on GCP (Artifact Registry, Cloud Run) and now AWS (S3) thanks to additional funding from Amazon and we will be serving substantially less expensive egress traffic. In the future it might include additional hosts / sponsors (https://registry.k8s.io#stability).

Just serving AWS traffic (which is the majority) from region-local AWS storage should bring us back within our budgets.

We have a lot more context in the docs (https://registry.k8s.io) and this talk https://www.youtube.com/watch?v=9CdzisDQkjE

recollir commented 1 year ago

@BenTheElder 👍

dims commented 1 year ago

Experiment results for redirect k8s.gcr.io->registry.k8s.io last october: https://kubernetes.slack.com/archives/CCK68P2Q2/p1666725317568709

dims commented 1 year ago

xref: https://github.com/kubernetes/website/issues/39887

dims commented 1 year ago

this text may get dropped from the blog post being drafted for automatic redirects, so saving it here:

Technical Details

The new registry.k8s.io is a secure blob redirector that allows the Kubernetes project to direct traffic based on request IP to the best possible blob storage for the user. If a user makes a request from an AWS region network and pulls a Kubernetes container image, for example, that user will be automatically redirected to pull an image from the closest S3 bucket image layer store. For the current decision tree, refer to this architecture decision tree [1]. To be clear, the new registry.k8s.io implementation allows the upstream project to host registries on more clouds in the future, not just GCP and AWS, which will increase stability, reduce cost, and accelerate bothspeed downloads and deployments. Please do not rely on the internal implementation details of the new image registry as these can be changed without notice.

Please note the upstream Kubernetes teams are working to provide additional communication, and the situation around how long the old registry remains is still being discussed.

[1]: https://kubernetes.io/blog/2023/02/06/k8s-gcr-io-freeze-announcement/ [2]: https://github.com/kubernetes/registry.k8s.io/blob/main/cmd/archeio/docs/request-handling.md

afbjorklund commented 1 year ago

The first step for minikube will be to start adding --image-repository=registry.k8s.io to the old kubeadm commands.

Probably add it to all kubeadm versions before 1.25.0, shouldn't hurt anything if it is already the default registry...

The second step is to retag all the older preloads with the new registry, to work air-gapped (but rather small download)

Some mirrors might still use a "k8s.gcr.io" subdirectory, which is fine, so this change is only for the default registry.


Main issue is that those people who are pulling those older kubernetes releases, also use older versions of minikube.

Or if we invalidate old caches, and have people pull "new" versions of the same images - but with a different name...

~/.minikube/cache/images/amd64 : k8s.gcr.io/pause_3.6 -> registry.k8s.io/pause_3.6

That would be somewhat contra-productive, so trying to "upgrade" those old caches in place (by re-tagging images)

BenTheElder commented 1 year ago

kubeadm had the default changed in patch releases back to 1.23 (older releases were not accepting any patches), when we published https://kubernetes.io/blog/2022/11/28/registry-k8s-io-faster-cheaper-ga/

dims commented 1 year ago

So on March 20, we'll be turning on redirects for almost everyone from k8s.gcr.io to registry.k8s.io, details here: https://kubernetes.io/blog/2023/03/10/image-registry-redirect/

So the next question will be, how may folks still be using the underlying content of k8s.gcr.io from other ways:

So we'll have to then watch how much savings we get over time. Assuming about a week of roll out starting March 20, we'll get some concrete data a week or so after that ( lets' say April 3rd - monday given we have a saw tooth pattern of usage over the week with lows on saturday and sunday )

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 10 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/k8s.io/issues/4872#issuecomment-1900581003): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sftim commented 10 months ago

/reopen

sftim commented 10 months ago

We did this /close

k8s-ci-robot commented 10 months ago

@sftim: Reopened this issue.

In response to [this](https://github.com/kubernetes/k8s.io/issues/4872#issuecomment-1900643371): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 10 months ago

@sftim: Closing this issue.

In response to [this](https://github.com/kubernetes/k8s.io/issues/4872#issuecomment-1900643544): >We did this >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sftim commented 10 months ago

(but feel free to reopen if needed)