Open spiffxp opened 3 years ago
Neat, I think I just caught this happening for k8s-testimages jobs that run in the "test-infra-trusted" cluster too
Yup, it doesn't happen often, but it does happen for test-infra-trusted (screenshot instead of link since access is restricted to google.com)
Migrating away from a central gcb-builder service account:
This is the more generic / less one-off version of steps I listed in https://github.com/kubernetes/test-infra/pull/20703#issuecomment-774224609
@cpanato -- You were last working on this. What are the next steps?
/unassign /milestone v1.22
@justaugustus @spiffxp sorry for the delay in replying on this, I was doing some investigations, and I will describe my findings and possible options that I can see (you all might have other options :) )
Issue: When the cloudbuild is triggered, sometimes it failed because we receive Quota exceeded for quota metric 'Build and Operation Get requests'
Aaron said this is something we cannot increase, so I did some tests using my account to simulate the same environment.
For example, in some releng cases, we have a PR that might trigger some images to build after we merge it. Those images have more than one variant (in some cases have four variants), which means we will trigger the cloudbuild 4+ simultaneously, and that can cause the quota issue we receive failing some jobs.
The image-builder in this code snippet is responsible for triggering the jobs when having variants https://github.com/kubernetes/test-infra/blob/30af69f55010472e3032101af894f020c2484676/images/builder/main.go#L309-L321 We can add some delay to trigger the next and then avoid all jobs being push almost at the same time.
I reproduce the issue using my account. triggered ~15 jobs in parallel. (cleaning the logs for better visualization)
DEBUG: Retrying request to url https://cloudbuild.googleapis.com/v1/projects/cpanato-capg-test/locations/global/builds/4365-a9c1-?alt=json after exception HttpError accessing <https://cloudbuild.googleapis.com/v1/projects/cpanato-capg-test/locations/global/builds/4365-a9c1-?alt=json>: response: "error": {
"code": 429,
"message": "Quota exceeded for quota metric 'Build and Operation Get requests' and limit 'Build and Operation Get requests per minute' of service 'cloudbuild.googleapis.com' for consumer 'project_number:'.",
"status": "RESOURCE_EXHAUSTED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "RATE_LIMIT_EXCEEDED",
"domain": "googleapis.com",
"metadata": {
"consumer": "projects/985606222016",
"quota_limit": "GetRequestsPerMinutePerProject",
"quota_metric": "cloudbuild.googleapis.com/get_requests",
"service": "cloudbuild.googleapis.com"
}
}
]
}
}
Having a service account per job might fix this issue, but I think the quota Build and Operation Get requests
is per project and not per service account.
So maybe if we delay the start when jobs have variants might work in this case.
What are your thoughts? I can make the change in the image-builder code to add this delay or even maybe check the response and if matches that we wait and retry.
I will respond in more detail next week. I still think we should isolate service accounts and the projects they can build. But.
A more surgical fix might be to update image-builder to invoke gcloud with the --billing-project flag with the staging project as its value. That should cause quota to get counted against the staging project instead of the project associated with the service account running image-builder
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale This still remains an issue. I still think either approach I've recommended above is valid:
ensure_staging_gcb_builder_service_account
in https://github.com/kubernetes/k8s.io/blob/main/infra/gcp/ensure-staging-storage.sh); it might result in more boilerplate thoughgcloud --billing-project
for whichever project it's pushing to might be a quicker fix, assuming everything runs in and pushes to its same respective GCP project; it might require granting serviceusage.use
on a lot of projects to a singular service account, which feels less great from a security perspectiveIMO neither of these are show-stopper tradeoffs, and I'm happy to help ensure whomever is interested has the appropriate permissions to play around on a project.
In the meantime I'm opening a PR to create a single testgrid dashboard for all image pushing jobs, so we can get a better sense of when and how often we're hitting this.
/milestone v1.23
FYI @chaodaiG this might be worth keeping in mind given the GCB design proposal you presented at today's SIG Testing meeting
I see this is happening all over the place in kubernetes/release, e.g. https://github.com/kubernetes/release/pull/2266
New theory: gcloud builds submit
streams logs by default. I suspect behind the scenes it's doing periodic polling at a high enough rate that N gets/s per gcloud builds submit
X M PRs x O cloudbuilds > 750/s
We could update the builder to gcloud builds submit --async
, periodically poll at a low rate of our choosing, and then gcloud builds logs
once a build is done. This starts to look an awful lot like mini GCB controller, which is what was proposed at SIG Testing this week. Unfortunately that's going to take a few months to land.
Screenshot from https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?project=k8s-infra-prow-build-trusted&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null)). Not entirely sure who can see this page, but members of k8s-infra-prow-oncall@kubernetes.io should be able to at least.
The bottom graph shows quota violations over time. As a reminder this is against a quota we can't raise.
So I think it's both. The shared gcb-builder service account will hit this problem if it's triggering builds across too many k8s-staging projects in general. And then also the gcb-builder-releng-test service account hits this problem when triggering too many builds in parallel within its single project (maybe because of log tailing)
/milestone v1.24
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
This seems like a valid "we really ought to fix this someday" but backlog/low-priority and no movement. Punting to someday milestone.
/remove-priority important-soon /priority backlog
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
What happened: Context: https://github.com/kubernetes/k8s.io/issues/1576#issuecomment-767691409
As the volume of image-pushing jobs running on the prow build cluster in k8s-infra-prow-build-trusted has grown, we're starting to bump into a GCB service quota (GetRequestsPerMinutePerProject) for the project. This isn't something we can request to raise like other quota (e.g. max gcp instances per region)
What you expected to happen: Have GCB service requests charged to the project running the GCB builds instead of a central shared project. Avoid bumping into API-related quota.
How to reproduce it (as minimally and precisely as possible): Merge a PR to kubernetes/kubernetes that updates multiple test/images subdirectories, or otherwise induce a high volume of image-pushing jobs on k8s-infra-prow-build-trusted
Ignore whether you bump into the concurrent builds quota (also a GCB service quota)
Can visualize usage (and whether quota is hit) here if a member of k8s-infra-prow-viewers@kubernetes.io: https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?orgonly=true&project=k8s-infra-prow-build-trusted&supportedpurview=project&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null))
Please provide links to example occurrences, if any: Don't have link to jobs that encountered this specifically, but https://github.com/kubernetes/k8s.io/issues/1576 describes the issue, and the metric explorer link above shows roughly when we've bumped into quota.
Anything else we need to know?: Parent issue: https://github.com/kubernetes/release/issues/1869
My guess is that we need to move away from using a shared service account in the build cluster's project (gcb-builder@k8s-infra-prow-build-trusted), and instead setup service accounts per staging project.
It's unclear to me whether these would all need access to something in the build cluster project.
A service-account-per-project would add a bunch of boilerplate to the service accounts loaded into the build cluster, and add another field to job configs that needs to be set manually vs. copy-pasted. We could offset this by verifying configs are correct via presubmit enforcement.
I'm open to other suggestions to automate the boilerplate away, or a solution that involves image-builder consuming less API quota.
/milestone v1.21 /priority important-soon /wg k8s-infra /sig testing /area images /sig release /area release-eng /assign @cpanato @justaugustus as owners of parent issue