[infra.ci.jenkins.io] Builds stucks due to GH API rate limit

dduportal commented 4 months ago

Service(s)

infra.ci.jenkins.io

Summary

Every time we apply a new JCasc change to infra.ci.jenkins.io, the JobDSL refresh (reapply?) kicks a lot of Multibranch pipeline scanning, which makes us reaching the "imposed" GH rate limit, locking builds for minutes or even 1 hour sometimes (time to get a renewed rate limit).

This unwanted behavior is slowing us down almost every weeks now: a fix or compensation measure is required.

There are multiple solutions here (to be discussed and reported in team meeting when triaging this issue)

(to validate) Use skipInitialBuildOnFirstBranchIndexing() for MB folders (ref. https://community.jenkins.io/t/multibranch-and-configuration-reload-triggering-rebuild-branch-indexing/9590/2)
Disable tag discovery on all jobs not requiring it
Use multiple GitHub Apps to spread the load between jobs
Others?

Reproduction steps

No response

dduportal commented 3 months ago

* (to validate) Use `skipInitialBuildOnFirstBranchIndexing()` for MB folders (ref. https://community.jenkins.io/t/multibranch-and-configuration-reload-triggering-rebuild-branch-indexing/9590/2)

Tested this scenario on a local instance:

Once the first MB scanning is finished with success, the following JCasc reload did NOT trigger anymore scanning (unless triggerered manually or by a cron).
I was able to try a git hook (easy one) and it triggerered a scan and build. Let's see how it behaves with the GitHub webhooks on production for infra.ci.

Todo list:

[x] Implement the chart JobDSL code: https://github.com/jenkins-infra/helm-charts/pull/1244
[x] Deploy the new setup on infra.ci.jenkins.io
[x] Validate the new behavior when reloading JCAsc after the first initial wave

dduportal commented 3 months ago

Update: after 2 weeks of usage, the amount of "reaching the rate limit" drastically decreased. But it still happen when touching the jobs configuration.

After checking, it does not look like there are techniques to prevent re-indexing using Multibranch pipelines. It's only a feature on Organization folders.

We could redefined the jobs configuration tree using Organization folder but at the cost of:

Grouping jobs with the exact same MB level settings per top-level Org folder (as we cannot have configuration exception).
- Good for Terraform and Docker jobs. Eventually for the websites PR. But bad for all other jobs.
And implementing the GH org. folder in our jenkins-jobs helm chart

On short term, I'll start splitting into multiple GH app per directories to unblock us.

dduportal commented 3 months ago

First try:

Created a new GH app named infra.ci.jenkins.io-docker-jobs
Only has access to 21 repositories defined in https://github.com/jenkins-infra/kubernetes-management/blob/589bd47280d110b170c1cde1aefdb82ea585145c/config/jenkins-jobs_infra.ci.jenkins.io.yaml#L21 (except jenkinsci/acceptance-test-harness which needs a special installation or a distinct GH credential)

Permissions are the one defined in https://docs.cloudbees.com/docs/cloudbees-ci/latest/traditional-admin-guide/github-app-auth#_creating_the_github_app merged with the existing infra.ci.jenkins.io GH app except
- "Repository -> Administration (Read Only)" => I believe we don't care about repository created or deleted
- Code access is (Read Only) => it was (Read-Write) on current GH app but I would like to improve safety and forbid any kind of code change unless a specific credential is used (enforced by pipeline library).

dduportal commented 3 months ago

First try:

* Created a new GH app named `infra.ci.jenkins.io-docker-jobs`

* Only has access to 21 repositories defined in https://github.com/jenkins-infra/kubernetes-management/blob/589bd47280d110b170c1cde1aefdb82ea585145c/config/jenkins-jobs_infra.ci.jenkins.io.yaml#L21 (except jenkinsci/acceptance-test-harness which needs a special installation or a distinct GH credential)

* Permissions are the one defined in https://docs.cloudbees.com/docs/cloudbees-ci/latest/traditional-admin-guide/github-app-auth#_creating_the_github_app merged with the existing `infra.ci.jenkins.io` GH app **except**

  * "Repository -> Administration (Read Only)" => I believe we don't care about repository created or deleted
  * Code access is (Read Only) => it was (Read-Write) on current GH app but I would like to improve safety and forbid any kind of code change **unless** a specific credential is used (enforced by pipeline library).

Initial test looks good:

Set up (manually) the new Jenkins credential of type "GitHub App" in the "Docker Jobs" directory scope (so it's only available for Docker Jobs)
Changed the accout-app job to use the new folder-scoped repository and triggered a new scan
Checked the API rate limits were evolving differently between the top-level repository and the new one (tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization
Ran a full build on the principal branch which triggered a tag push and a new release.

=> let's roll to persist the change!

dduportal commented 3 months ago

Update:

Set up custom GH App for both Docker and Terraform Jobs
Cleaned up GH APP now unsued (KodiakHQ, Octobox, Slack)
Restricted the existing infra.ci GH app to less repositories (removed all Docker and Terraform Jobs from it)

=> That won't prevent the "Scan Repo" wave but it should prevent the API rate limit greatly!

timja commented 3 months ago

(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization

Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)

dduportal commented 3 months ago

(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization

Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)

Thanks Tim! Good to know if we hit the rate limit again!

dduportal commented 3 months ago

* Restricted the existing infra.ci GH app to less repositories (removed all Docker and Terraform Jobs from it)

Was a bit too restrictive: the (private) repository jenkins-infra/chart-secrets was missing so kubernetes-management was failing since 12h...

dduportal commented 3 months ago

Ah: found an error with the GH App credentials. The specified owner string jenkins-infra is changed to jenkins when JobDSL processes its configuration.

Incoming fix in the chart-generated JobDSL code: https://github.com/jenkins-infra/helm-charts/pull/1303

dduportal commented 3 months ago

All builds are now green \o/ (after the fix https://github.com/jenkins-infra/helm-charts/pull/1303)
No more builds blocked: I tried to reload JCasc 4 times manually and immediately triggered a terraform build (production):
- The Terraform build was treated immediately
- No API rate limit issues anymore

I want to run a complete JobDSL reload (incoming fix for deprecated directives in our chart) before closing this issue to confirm we are not blocked anymore

dduportal commented 3 months ago

looks good!

dduportal commented 3 months ago

Reopening as the Docker GH app still hits its rate limit :'(

dduportal commented 3 months ago

The situation is clearly way better than before, but we still have improvements and fixes to make:

Must have:
- Fix the Updatecli folder which is failing. Success criteria is to remove the top-level update credential
- Implement Organization folder for Docker Jobs and Update Jobs
- Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL
Nice to have:
- Get rid of discovering (and building) tags for Docker Jobs:
- Changes the buildAndPublishDocker pipeline library to build tag during the principal branch semantic release and only push a git tag for audit purpose
- Set up main branch to remove old builds only if they are older than 6 month (so we can keep audit logs)

timja commented 3 months ago

Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)

dduportal commented 3 months ago

Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)

Yep, I believe it should work, thanks for the confirmation! We should report the issue though: Multibranch jobs should have a trait to prevent this indexing during a JobDSL reload

dduportal commented 3 months ago

The situation is clearly way better than before, but we still have improvements and fixes to make:

* Must have:

  * Fix the Updatecli folder which is failing. Success criteria is to remove the top-level update credential
  * Implement Organization folder for Docker Jobs and Update Jobs

    * Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

* Nice to have:

  * Get rid of discovering (and building) tags for Docker Jobs:

    * Changes the buildAndPublishDocker pipeline library to build tag during the principal branch semantic release and only push a git tag for audit purpose
    * Set up main branch to remove old builds only if they are older than 6 month (so we can keep audit logs)

Update:

Updatecli, Terraform and Docker jobs are using their respective per-folder credentials now!

=> Now let's work on support for Org folder

dduportal commented 2 months ago

Update: situation is improved but we're not out of the wood yet.

The key consumer is the "tags discovery" of MB jobs. Since we have hundreds of tags for each Docker-* (and packer-images) repository, all of them using a "push tag -> trigger build -> release" currently so need work to change this behavior
Splitting the GH app into 4 (Docker, Updatecli, Terraform, others) helped to unblock the kubernetes and terraform runs which decrease the priority here
- Good thing for splitting permissions, bad thing for managing these apps and it decreases the rate limit for each
- We need to add others: 1 for Kubernetes (includes a private repo), another for the websites PR and a last one websites deployments
Work in progress in implement JobDSL-defined Organization (scanning) Folder in our custom Helm chart. Worth working on it:
- Needed to decrease the amount of indexing scans of MB children jobs for: Docker, Terraform and Updatecli (additionally: simplify jobs setup)
- Required for updatecli pipelines separation
- Required for https://github.com/jenkins-infra/helpdesk/issues/3071
Work in progress in fine tuning the current MB jobs indexing: caught (while working on org folders) that we try to index them every 2 hours. Setting a lower frequency would also spread the rate limit usage!

Next steps is heavier but would be worth it:

Redefine the Docker build process (https://github.com/jenkins-infra/pipeline-library/blob/master/vars/buildDockerAndPublishImage.groovy)
- Need an issue in the repository to sustain the discussion
- Goal: change the release pattern to avoid discovering tags in infra.ci Jobs to sustain years of building images
- Part of this discussion must include the Garbage collection: how much time do we want to keep:
- Docker Image tags?
- GitHub Tags and releases for these Docker images?
- Release jobs logs for these images?

dduportal commented 2 months ago

Minor update: https://github.com/jenkins-infra/kubernetes-management/pull/5640 should decrease the "waves" of indexing when reloading to only 1 wave

dduportal commented 2 months ago

Update:

The PR https://github.com/jenkins-infra/kubernetes-management/pull/5644 did trigger a restart of infra.ci which never went back online.

Logs were showing the following JobDSL errors:

groovy.lang.MissingMethodException: No signature of method: javaposse.jobdsl.dsl.Folder.multibranchPipelineJob() is applicable for argument types: (java.lang.String, script$_run_closure2) values: [docker-jobs/xxx, script$_run_closure2@19995442]

Short term fix: reverted back the 2 previous jenkins-jobs chart (https://github.com/jenkins-infra/kubernetes-management/pull/5640 and https://github.com/jenkins-infra/kubernetes-management/pull/5641) to have infra.ci back online.
Current status: infra.ci is up but the kubernetes management jobs is failing due to helm state inconsistency "update/rollback in progress). Keeping like this until a fix is found and tested.
I can reproduce the error only with the helm chart on a local k3d: working on it

dduportal commented 2 months ago

Update:

* The PR [Bump `jenkins` helm chart version to 5.5.13 kubernetes-management#5644](https://github.com/jenkins-infra/kubernetes-management/pull/5644) did trigger a restart of infra.ci which never went back online.

* Logs were showing the following JobDSL errors:
  ```
  groovy.lang.MissingMethodException: No signature of method: javaposse.jobdsl.dsl.Folder.multibranchPipelineJob() is applicable for argument types: (java.lang.String, script$_run_closure2) values: [docker-jobs/xxx, script$_run_closure2@19995442]
  ```

* Short term fix: reverted back the 2 previous `jenkins-jobs` chart ([Bump `jenkins-jobs` helm chart version to 2.1.2 kubernetes-management#5640](https://github.com/jenkins-infra/kubernetes-management/pull/5640) and [Bump `jenkins-jobs` helm chart version to 2.2.0 kubernetes-management#5641](https://github.com/jenkins-infra/kubernetes-management/pull/5641)) to have infra.ci back online.

* Current status: infra.ci is up but the kubernetes management jobs is failing due to helm state inconsistency "update/rollback in progress). Keeping like this until a fix is found and tested.

* I can reproduce the error only with the helm chart on a local k3d: working on it

OH: https://github.com/helm/helm/issues/5446#issuecomment-866994215

However removing any instance of a space before a newline fixed it.

dduportal commented 2 months ago

Looks like the infra.ci errors are good 👍

jenkins-infra / helpdesk