Mitigate image-pushing jobs hitting GetRequestsPerMinutePerProject quota for prow build cluster project

spiffxp commented 3 years ago

What happened: Context: https://github.com/kubernetes/k8s.io/issues/1576#issuecomment-767691409

As the volume of image-pushing jobs running on the prow build cluster in k8s-infra-prow-build-trusted has grown, we're starting to bump into a GCB service quota (GetRequestsPerMinutePerProject) for the project. This isn't something we can request to raise like other quota (e.g. max gcp instances per region)

What you expected to happen: Have GCB service requests charged to the project running the GCB builds instead of a central shared project. Avoid bumping into API-related quota.

How to reproduce it (as minimally and precisely as possible): Merge a PR to kubernetes/kubernetes that updates multiple test/images subdirectories, or otherwise induce a high volume of image-pushing jobs on k8s-infra-prow-build-trusted

Ignore whether you bump into the concurrent builds quota (also a GCB service quota)

Can visualize usage (and whether quota is hit) here if a member of k8s-infra-prow-viewers@kubernetes.io: https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?orgonly=true&project=k8s-infra-prow-build-trusted&supportedpurview=project&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null))

Please provide links to example occurrences, if any: Don't have link to jobs that encountered this specifically, but https://github.com/kubernetes/k8s.io/issues/1576 describes the issue, and the metric explorer link above shows roughly when we've bumped into quota.

Anything else we need to know?: Parent issue: https://github.com/kubernetes/release/issues/1869

My guess is that we need to move away from using a shared service account in the build cluster's project (gcb-builder@k8s-infra-prow-build-trusted), and instead setup service accounts per staging project.

It's unclear to me whether these would all need access to something in the build cluster project.

A service-account-per-project would add a bunch of boilerplate to the service accounts loaded into the build cluster, and add another field to job configs that needs to be set manually vs. copy-pasted. We could offset this by verifying configs are correct via presubmit enforcement.

I'm open to other suggestions to automate the boilerplate away, or a solution that involves image-builder consuming less API quota.

/milestone v1.21 /priority important-soon /wg k8s-infra /sig testing /area images /sig release /area release-eng /assign @cpanato @justaugustus as owners of parent issue

spiffxp commented 3 years ago

Neat, I think I just caught this happening for k8s-testimages jobs that run in the "test-infra-trusted" cluster too

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/post-test-infra-push-kettle/1357442531670888448#1:build-log.txt%3A57

spiffxp commented 3 years ago

Yup, it doesn't happen often, but it does happen for test-infra-trusted (screenshot instead of link since access is restricted to google.com)

Screen Shot 2021-02-05 at 11 29 59 AM

spiffxp commented 3 years ago

Migrating away from a central gcb-builder service account:

https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/lib.sh#L52 - code that uses this will need to be refactored
https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/lib.sh#L255-L292 - e.g. update this to take a service account as a parameter (also... it is not clear to me that both prow and gcb SA's are necessary here)
https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/ensure-staging-storage.sh - create the service accounts and bind to workload identity in the loop that deals with all other staging project resources (calls to do this e.g. https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/ensure-main-project.sh#L144-L155)
https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/clusters/projects/k8s-infra-prow-build-trusted/prow-build-trusted/resources/build-serviceaccounts.yaml - service accounts go here
https://github.com/kubernetes/test-infra/blob/master/config/tests/jobs/jobs_test.go - add a test that enforces what's allowed to use a given service account (maybe use https://github.com/kubernetes/test-infra/blob/master/config/tests/jobs/jobs_test.go#L393-L498 as a reference)

This is the more generic / less one-off version of steps I listed in https://github.com/kubernetes/test-infra/pull/20703#issuecomment-774224609

justaugustus commented 3 years ago

@cpanato -- You were last working on this. What are the next steps?

/unassign /milestone v1.22

cpanato commented 3 years ago

@justaugustus @spiffxp sorry for the delay in replying on this, I was doing some investigations, and I will describe my findings and possible options that I can see (you all might have other options :) )

Issue: When the cloudbuild is triggered, sometimes it failed because we receive Quota exceeded for quota metric 'Build and Operation Get requests'

Aaron said this is something we cannot increase, so I did some tests using my account to simulate the same environment.

For example, in some releng cases, we have a PR that might trigger some images to build after we merge it. Those images have more than one variant (in some cases have four variants), which means we will trigger the cloudbuild 4+ simultaneously, and that can cause the quota issue we receive failing some jobs.

The image-builder in this code snippet is responsible for triggering the jobs when having variants https://github.com/kubernetes/test-infra/blob/30af69f55010472e3032101af894f020c2484676/images/builder/main.go#L309-L321 We can add some delay to trigger the next and then avoid all jobs being push almost at the same time.

I reproduce the issue using my account. triggered ~15 jobs in parallel. (cleaning the logs for better visualization)

DEBUG: Retrying request to url https://cloudbuild.googleapis.com/v1/projects/cpanato-capg-test/locations/global/builds/4365-a9c1-?alt=json after exception HttpError accessing <https://cloudbuild.googleapis.com/v1/projects/cpanato-capg-test/locations/global/builds/4365-a9c1-?alt=json>: response: "error": {
    "code": 429,
    "message": "Quota exceeded for quota metric 'Build and Operation Get requests' and limit 'Build and Operation Get requests per minute' of service 'cloudbuild.googleapis.com' for consumer 'project_number:'.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        "reason": "RATE_LIMIT_EXCEEDED",
        "domain": "googleapis.com",
        "metadata": {
          "consumer": "projects/985606222016",
          "quota_limit": "GetRequestsPerMinutePerProject",
          "quota_metric": "cloudbuild.googleapis.com/get_requests",
          "service": "cloudbuild.googleapis.com"
        }
      }
    ]
  }
}

Having a service account per job might fix this issue, but I think the quota Build and Operation Get requests is per project and not per service account. So maybe if we delay the start when jobs have variants might work in this case.

What are your thoughts? I can make the change in the image-builder code to add this delay or even maybe check the response and if matches that we wait and retry.

spiffxp commented 3 years ago

I will respond in more detail next week. I still think we should isolate service accounts and the projects they can build. But.

A more surgical fix might be to update image-builder to invoke gcloud with the --billing-project flag with the staging project as its value. That should cause quota to get counted against the staging project instead of the project associated with the service account running image-builder

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

spiffxp commented 3 years ago

/remove-lifecycle stale This still remains an issue. I still think either approach I've recommended above is valid:

a per-project service account still seems to me like the more correct approach from a security perspective, and the code in k8s.io is nearly there to provision all of these (see comments referencing ensure_staging_gcb_builder_service_account in https://github.com/kubernetes/k8s.io/blob/main/infra/gcp/ensure-staging-storage.sh); it might result in more boilerplate though
modifying image-builder to use the equivalent of gcloud --billing-project for whichever project it's pushing to might be a quicker fix, assuming everything runs in and pushes to its same respective GCP project; it might require granting serviceusage.use on a lot of projects to a singular service account, which feels less great from a security perspective

IMO neither of these are show-stopper tradeoffs, and I'm happy to help ensure whomever is interested has the appropriate permissions to play around on a project.

In the meantime I'm opening a PR to create a single testgrid dashboard for all image pushing jobs, so we can get a better sense of when and how often we're hitting this.

spiffxp commented 3 years ago

/milestone v1.23

spiffxp commented 3 years ago

FYI @chaodaiG this might be worth keeping in mind given the GCB design proposal you presented at today's SIG Testing meeting

spiffxp commented 3 years ago

I see this is happening all over the place in kubernetes/release, e.g. https://github.com/kubernetes/release/pull/2266

New theory: gcloud builds submit streams logs by default. I suspect behind the scenes it's doing periodic polling at a high enough rate that N gets/s per gcloud builds submit X M PRs x O cloudbuilds > 750/s

We could update the builder to gcloud builds submit --async, periodically poll at a low rate of our choosing, and then gcloud builds logs once a build is done. This starts to look an awful lot like mini GCB controller, which is what was proposed at SIG Testing this week. Unfortunately that's going to take a few months to land.

spiffxp commented 3 years ago

Screenshot from https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?project=k8s-infra-prow-build-trusted&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null)). Not entirely sure who can see this page, but members of k8s-infra-prow-oncall@kubernetes.io should be able to at least.

Screen Shot 2021-09-28 at 5 59 37 PM

The bottom graph shows quota violations over time. As a reminder this is against a quota we can't raise.

So I think it's both. The shared gcb-builder service account will hit this problem if it's triggering builds across too many k8s-staging projects in general. And then also the gcb-builder-releng-test service account hits this problem when triggering too many builds in parallel within its single project (maybe because of log tailing)

spiffxp commented 3 years ago

/milestone v1.24

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

BenTheElder commented 2 years ago

This seems like a valid "we really ought to fix this someday" but backlog/low-priority and no movement. Punting to someday milestone.

spiffxp commented 2 years ago

/remove-priority important-soon /priority backlog

k8s-triage-robot commented 10 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

kubernetes / test-infra

Mitigate image-pushing jobs hitting GetRequestsPerMinutePerProject quota for prow build cluster project #20652