jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

[infra.ci.jenkins.io] Builds stucks due to GH API rate limit #4165

Open dduportal opened 4 months ago

dduportal commented 4 months ago

Service(s)

infra.ci.jenkins.io

Summary

Every time we apply a new JCasc change to infra.ci.jenkins.io, the JobDSL refresh (reapply?) kicks a lot of Multibranch pipeline scanning, which makes us reaching the "imposed" GH rate limit, locking builds for minutes or even 1 hour sometimes (time to get a renewed rate limit).

This unwanted behavior is slowing us down almost every weeks now: a fix or compensation measure is required.


There are multiple solutions here (to be discussed and reported in team meeting when triaging this issue)

Reproduction steps

No response

dduportal commented 3 months ago
* (to validate) Use `skipInitialBuildOnFirstBranchIndexing()` for MB folders (ref. https://community.jenkins.io/t/multibranch-and-configuration-reload-triggering-rebuild-branch-indexing/9590/2)

Tested this scenario on a local instance:

Todo list:

dduportal commented 3 months ago

Update: after 2 weeks of usage, the amount of "reaching the rate limit" drastically decreased. But it still happen when touching the jobs configuration.

After checking, it does not look like there are techniques to prevent re-indexing using Multibranch pipelines. It's only a feature on Organization folders.

We could redefined the jobs configuration tree using Organization folder but at the cost of:

On short term, I'll start splitting into multiple GH app per directories to unblock us.

dduportal commented 3 months ago

First try:

Capture d’écran 2024-08-21 à 17 17 14
dduportal commented 3 months ago

First try:

* Created a new GH app named `infra.ci.jenkins.io-docker-jobs`

* Only has access to 21 repositories defined in https://github.com/jenkins-infra/kubernetes-management/blob/589bd47280d110b170c1cde1aefdb82ea585145c/config/jenkins-jobs_infra.ci.jenkins.io.yaml#L21 (except jenkinsci/acceptance-test-harness which needs a special installation or a distinct GH credential)
Capture d’écran 2024-08-21 à 17 17 14
* Permissions are the one defined in https://docs.cloudbees.com/docs/cloudbees-ci/latest/traditional-admin-guide/github-app-auth#_creating_the_github_app merged with the existing `infra.ci.jenkins.io` GH app **except**

  * "Repository -> Administration (Read Only)" => I believe we don't care about repository created or deleted
  * Code access is (Read Only) => it was (Read-Write) on current GH app but I would like to improve safety and forbid any kind of code change **unless** a specific credential is used (enforced by pipeline library).

Initial test looks good:

=> let's roll to persist the change!

dduportal commented 3 months ago

Update:

=> That won't prevent the "Scan Repo" wave but it should prevent the API rate limit greatly!

timja commented 3 months ago

(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization

Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)

dduportal commented 3 months ago

(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization

Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)

Thanks Tim! Good to know if we hit the rate limit again!

dduportal commented 3 months ago
* Restricted the existing infra.ci GH app to less repositories (removed all Docker and Terraform Jobs from it)

Was a bit too restrictive: the (private) repository jenkins-infra/chart-secrets was missing so kubernetes-management was failing since 12h...

dduportal commented 3 months ago

Ah: found an error with the GH App credentials. The specified owner string jenkins-infra is changed to jenkins when JobDSL processes its configuration.

Incoming fix in the chart-generated JobDSL code: https://github.com/jenkins-infra/helm-charts/pull/1303

dduportal commented 3 months ago

I want to run a complete JobDSL reload (incoming fix for deprecated directives in our chart) before closing this issue to confirm we are not blocked anymore

dduportal commented 3 months ago

looks good!

dduportal commented 3 months ago

Reopening as the Docker GH app still hits its rate limit :'(

dduportal commented 3 months ago

The situation is clearly way better than before, but we still have improvements and fixes to make:

timja commented 3 months ago

Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)

dduportal commented 3 months ago

Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)

Yep, I believe it should work, thanks for the confirmation! We should report the issue though: Multibranch jobs should have a trait to prevent this indexing during a JobDSL reload

dduportal commented 3 months ago

The situation is clearly way better than before, but we still have improvements and fixes to make:

* Must have:

  * Fix the Updatecli folder which is failing. Success criteria is to remove the top-level update credential
  * Implement Organization folder for Docker Jobs and Update Jobs

    * Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL

* Nice to have:

  * Get rid of discovering (and building) tags for Docker Jobs:

    * Changes the buildAndPublishDocker pipeline library to build tag during the principal branch semantic release and only push a git tag for audit purpose
    * Set up main branch to remove old builds only if they are older than 6 month (so we can keep audit logs)

Update:

=> Now let's work on support for Org folder

dduportal commented 2 months ago

Update: situation is improved but we're not out of the wood yet.

Next steps is heavier but would be worth it:

dduportal commented 2 months ago

Minor update: https://github.com/jenkins-infra/kubernetes-management/pull/5640 should decrease the "waves" of indexing when reloading to only 1 wave

dduportal commented 2 months ago

Update:

dduportal commented 2 months ago

Update:

* The PR [Bump `jenkins` helm chart version to 5.5.13 kubernetes-management#5644](https://github.com/jenkins-infra/kubernetes-management/pull/5644) did trigger a restart of infra.ci which never went back online.

* Logs were showing the following JobDSL errors:
  ```
  groovy.lang.MissingMethodException: No signature of method: javaposse.jobdsl.dsl.Folder.multibranchPipelineJob() is applicable for argument types: (java.lang.String, script$_run_closure2) values: [docker-jobs/xxx, script$_run_closure2@19995442]
  ```

* Short term fix: reverted back the 2 previous `jenkins-jobs` chart ([Bump `jenkins-jobs` helm chart version to 2.1.2 kubernetes-management#5640](https://github.com/jenkins-infra/kubernetes-management/pull/5640) and [Bump `jenkins-jobs` helm chart version to 2.2.0 kubernetes-management#5641](https://github.com/jenkins-infra/kubernetes-management/pull/5641)) to have infra.ci back online.

* Current status: infra.ci is up but the kubernetes management jobs is failing due to helm state inconsistency "update/rollback in progress). Keeping like this until a fix is found and tested.

* I can reproduce the error only with the helm chart on a local k3d: working on it

OH: https://github.com/helm/helm/issues/5446#issuecomment-866994215

However removing any instance of a space before a newline fixed it.

dduportal commented 2 months ago

Looks like the infra.ci errors are good 👍