Open dduportal opened 4 months ago
* (to validate) Use `skipInitialBuildOnFirstBranchIndexing()` for MB folders (ref. https://community.jenkins.io/t/multibranch-and-configuration-reload-triggering-rebuild-branch-indexing/9590/2)
Tested this scenario on a local instance:
Todo list:
Update: after 2 weeks of usage, the amount of "reaching the rate limit" drastically decreased. But it still happen when touching the jobs configuration.
After checking, it does not look like there are techniques to prevent re-indexing using Multibranch pipelines. It's only a feature on Organization folders.
We could redefined the jobs configuration tree using Organization folder but at the cost of:
jenkins-jobs
helm chartOn short term, I'll start splitting into multiple GH app per directories to unblock us.
First try:
infra.ci.jenkins.io-docker-jobs
infra.ci.jenkins.io
GH app except
First try:
* Created a new GH app named `infra.ci.jenkins.io-docker-jobs` * Only has access to 21 repositories defined in https://github.com/jenkins-infra/kubernetes-management/blob/589bd47280d110b170c1cde1aefdb82ea585145c/config/jenkins-jobs_infra.ci.jenkins.io.yaml#L21 (except jenkinsci/acceptance-test-harness which needs a special installation or a distinct GH credential)
* Permissions are the one defined in https://docs.cloudbees.com/docs/cloudbees-ci/latest/traditional-admin-guide/github-app-auth#_creating_the_github_app merged with the existing `infra.ci.jenkins.io` GH app **except** * "Repository -> Administration (Read Only)" => I believe we don't care about repository created or deleted * Code access is (Read Only) => it was (Read-Write) on current GH app but I would like to improve safety and forbid any kind of code change **unless** a specific credential is used (enforced by pipeline library).
Initial test looks good:
=> let's roll to persist the change!
Update:
=> That won't prevent the "Scan Repo" wave but it should prevent the API rate limit greatly!
(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization
Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)
(tip: go on "Update credential" and click the "Test Connection" button which shows the current rate limit => different limits per instralled GH app in the organization
Open telemetry plugin can export metrics which shows the rate limit usage over time. Which could be sent to datadog. In case you're interested, (it's how we monitor GitHub app rate limits)
Thanks Tim! Good to know if we hit the rate limit again!
* Restricted the existing infra.ci GH app to less repositories (removed all Docker and Terraform Jobs from it)
Was a bit too restrictive: the (private) repository jenkins-infra/chart-secrets was missing so kubernetes-management was failing since 12h...
Ah: found an error with the GH App credentials. The specified owner string jenkins-infra
is changed to jenkins
when JobDSL processes its configuration.
Incoming fix in the chart-generated JobDSL code: https://github.com/jenkins-infra/helm-charts/pull/1303
I want to run a complete JobDSL reload (incoming fix for deprecated directives in our chart) before closing this issue to confirm we are not blocked anymore
looks good!
Reopening as the Docker GH app still hits its rate limit :'(
The situation is clearly way better than before, but we still have improvements and fixes to make:
Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL
We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)
Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL
We run a few organisation folders and I've never came across the issue you're currently hitting (we don't use any multibranch folders)
Yep, I believe it should work, thanks for the confirmation! We should report the issue though: Multibranch jobs should have a trait to prevent this indexing during a JobDSL reload
The situation is clearly way better than before, but we still have improvements and fixes to make:
* Must have: * Fix the Updatecli folder which is failing. Success criteria is to remove the top-level update credential * Implement Organization folder for Docker Jobs and Update Jobs * Need an initial validation on a test controller that we can avoid children indexation when JCasc reload JobDSL * Nice to have: * Get rid of discovering (and building) tags for Docker Jobs: * Changes the buildAndPublishDocker pipeline library to build tag during the principal branch semantic release and only push a git tag for audit purpose * Set up main branch to remove old builds only if they are older than 6 month (so we can keep audit logs)
Update:
=> Now let's work on support for Org folder
Update: situation is improved but we're not out of the wood yet.
The key consumer is the "tags discovery" of MB jobs. Since we have hundreds of tags for each Docker-* (and packer-images) repository, all of them using a "push tag -> trigger build -> release" currently so need work to change this behavior
Splitting the GH app into 4 (Docker, Updatecli, Terraform, others) helped to unblock the kubernetes and terraform runs which decrease the priority here
Work in progress in implement JobDSL-defined Organization (scanning) Folder in our custom Helm chart. Worth working on it:
Work in progress in fine tuning the current MB jobs indexing: caught (while working on org folders) that we try to index them every 2 hours. Setting a lower frequency would also spread the rate limit usage!
Next steps is heavier but would be worth it:
Minor update: https://github.com/jenkins-infra/kubernetes-management/pull/5640 should decrease the "waves" of indexing when reloading to only 1 wave
Update:
groovy.lang.MissingMethodException: No signature of method: javaposse.jobdsl.dsl.Folder.multibranchPipelineJob() is applicable for argument types: (java.lang.String, script$_run_closure2) values: [docker-jobs/xxx, script$_run_closure2@19995442]
jenkins-jobs
chart (https://github.com/jenkins-infra/kubernetes-management/pull/5640 and https://github.com/jenkins-infra/kubernetes-management/pull/5641) to have infra.ci back online.Update:
* The PR [Bump `jenkins` helm chart version to 5.5.13 kubernetes-management#5644](https://github.com/jenkins-infra/kubernetes-management/pull/5644) did trigger a restart of infra.ci which never went back online. * Logs were showing the following JobDSL errors: ``` groovy.lang.MissingMethodException: No signature of method: javaposse.jobdsl.dsl.Folder.multibranchPipelineJob() is applicable for argument types: (java.lang.String, script$_run_closure2) values: [docker-jobs/xxx, script$_run_closure2@19995442] ``` * Short term fix: reverted back the 2 previous `jenkins-jobs` chart ([Bump `jenkins-jobs` helm chart version to 2.1.2 kubernetes-management#5640](https://github.com/jenkins-infra/kubernetes-management/pull/5640) and [Bump `jenkins-jobs` helm chart version to 2.2.0 kubernetes-management#5641](https://github.com/jenkins-infra/kubernetes-management/pull/5641)) to have infra.ci back online. * Current status: infra.ci is up but the kubernetes management jobs is failing due to helm state inconsistency "update/rollback in progress). Keeping like this until a fix is found and tested. * I can reproduce the error only with the helm chart on a local k3d: working on it
OH: https://github.com/helm/helm/issues/5446#issuecomment-866994215
However removing any instance of a space before a newline fixed it.
Looks like the infra.ci errors are good 👍
Service(s)
infra.ci.jenkins.io
Summary
Every time we apply a new JCasc change to infra.ci.jenkins.io, the JobDSL refresh (reapply?) kicks a lot of Multibranch pipeline scanning, which makes us reaching the "imposed" GH rate limit, locking builds for minutes or even 1 hour sometimes (time to get a renewed rate limit).
This unwanted behavior is slowing us down almost every weeks now: a fix or compensation measure is required.
There are multiple solutions here (to be discussed and reported in team meeting when triaging this issue)
skipInitialBuildOnFirstBranchIndexing()
for MB folders (ref. https://community.jenkins.io/t/multibranch-and-configuration-reload-triggering-rebuild-branch-indexing/9590/2)Reproduction steps
No response