jenkinsci / kubernetes-operator

Kubernetes native Jenkins Operator
https://jenkinsci.github.io/kubernetes-operator
Other
598 stars 234 forks source link

Seed job fails after Jenkins restart with backups enabled #607

Open Bakies opened 3 years ago

Bakies commented 3 years ago

Describe the bug After my Jenkins controller restarts the seed job fails to start. I think there's a race condition between the restore job and the seed job starting. Probably because the seed job is configured in JCasC the job is setup before the restore and it creates the file nextBuildNumber with a 1, and the restore may not override it? It takes a long time, if ever, before the seed job runs and restores the config for the rest of the jobs.

I'm currently thinking I will just exclude the seed jobs from backups. I don't think I particularly care about their history.

To Reproduce Configure a seed job in JCasC Run it a few times Delete jenkins pod

Additional information

Kubernetes version: 1.19 Jenkins Operator version: v0.5.0

Add error logs about the problem here (operator logs and Kubernetes events). jenkins-master container logs:

2021-07-28 19:59:25.874+0000 [id=146]   WARNING j.model.lazy.LazyBuildMixIn#newBuild: A new build could not be created in job github-job-dsl-seed
java.lang.IllegalStateException: JENKINS-23152: /var/lib/jenkins/jobs/github-job-dsl-seed/builds/1 already existed; will not overwrite with github-job-dsl-seed #1
        at hudson.model.RunMap.put(RunMap.java:189)
        at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:182)
        at hudson.model.AbstractProject.newBuild(AbstractProject.java:963)
        at hudson.model.AbstractProject.createExecutable(AbstractProject.java:1139)
        at hudson.model.AbstractProject.createExecutable(AbstractProject.java:138)
        at hudson.model.Executor$1.call(Executor.java:365)
        at hudson.model.Executor$1.call(Executor.java:347)
        at hudson.model.Queue._withLock(Queue.java:1443)
        at hudson.model.Queue.withLock(Queue.java:1304)
        at hudson.model.Executor.run(Executor.java:347)
2021-07-28 19:59:25.875+0000 [id=146]   SEVERE  hudson.model.Executor#run: Executor #4 for seed-job-agent: Unexpected executor death
java.lang.IllegalStateException: JENKINS-23152: /var/lib/jenkins/jobs/github-job-dsl-seed/builds/1 already existed; will not overwrite with github-job-dsl-seed #1
        at hudson.model.RunMap.put(RunMap.java:189)
        at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:182)
Caused: java.lang.Error
        at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:190)
        at hudson.model.AbstractProject.newBuild(AbstractProject.java:963)
        at hudson.model.AbstractProject.createExecutable(AbstractProject.java:1139)
        at hudson.model.AbstractProject.createExecutable(AbstractProject.java:138)
        at hudson.model.Executor$1.call(Executor.java:365)
        at hudson.model.Executor$1.call(Executor.java:347)
        at hudson.model.Queue._withLock(Queue.java:1443)
        at hudson.model.Queue.withLock(Queue.java:1304)
        at hudson.model.Executor.run(Executor.java:347)
2021-07-28 19:59:54.148+0000 [id=144]   WARNING j.model.lazy.LazyBuildMixIn#newBuild: A new build could not be created in job github-job-dsl-seed
java.lang.IllegalStateException: JENKINS-23152: /var/lib/jenkins/jobs/github-job-dsl-seed/builds/2 already existed; will not overwrite with -job-dsl-seed #2
        at hudson.model.RunMap.put(RunMap.java:189)
        at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:182)
        at hudson.model.AbstractProject.newBuild(AbstractProject.java:963)
        at hudson.model.AbstractProject.createExecutable(AbstractProject.java:1139)
        at hudson.model.AbstractProject.createExecutable(AbstractProject.java:138)
        at hudson.model.Executor$1.call(Executor.java:365)
        at hudson.model.Executor$1.call(Executor.java:347)
        at hudson.model.Queue._withLock(Queue.java:1443)
        at hudson.model.Queue.withLock(Queue.java:1304)
        at hudson.model.Executor.run(Executor.java:347)
Bakies commented 3 years ago

Doesn't seem totally consistent, probably some race condition somewhere.

justyns commented 3 years ago

We also noticed a similar issue. The seed job starts before the restore job finishes. This causes issues with Jenkins trying to re-index repos that it doesn't need to.

If you go to Manage Jenkins and click the "Reload Configuration from Disk" button, it fixes the error @Bakies posted - but a better solution imo would be for the seed job to wait until the restore process is finished before triggering.

Bakies commented 3 years ago

Ah, thanks for that workaround, that button should be helpful to me : )

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this issue is still affecting you, just comment with any updates and we'll keep it open. Thank you for your contributions.

cwitthaus commented 2 years ago

I encountered this recently as well. I have solved it for now by slightly changing the backup.sh script. I added --exclude jobs/*job-dsl-seed to the tar command and overwrote the scripts in the default backup container by following the process in https://jenkinsci.github.io/kubernetes-operator/docs/getting-started/latest/custom-backup-and-restore/.

Harguer commented 2 years ago

Hi! I was wondering if there is an update on this. I'm facing the same issue, when my jenkins pod dies, the new pod won't trigger the seed-jobs and fails with that error.

2022-06-13 13:37:53.596+0000 [id=97]    WARNING j.model.lazy.LazyBuildMixIn#newBuild: A new build could not be created in job seed-jobs-job-dsl-seed
java.lang.IllegalStateException: JENKINS-23152: /var/lib/jenkins/jobs/seed-jobs-job-dsl-seed/builds/1 already existed; will not overwrite with seed-jobs-job-dsl-seed #1
    at hudson.model.RunMap.put(RunMap.java:194)
emyes commented 2 years ago

Do we have any updates on this issue ? We are also facing similar issues at our end.

Bakies commented 1 year ago

Anti-stale bot comment

On Sat, Apr 16, 2022, 8:31 AM stale[bot] @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this issue is still affecting you, just comment with any updates and we'll keep it open. Thank you for your contributions.

— Reply to this email directly, view it on GitHub https://github.com/jenkinsci/kubernetes-operator/issues/607#issuecomment-1100653556, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKS5HFI5XURAEWFV4EONPDVFKXLNANCNFSM5BGPJSVQ . You are receiving this because you were mentioned.Message ID: @.***>

michalgoldys commented 1 year ago

I can confirm that we've got the same problem:

2022-10-21 13:43:15.839+0000 [id=108]   WARNING j.model.lazy.LazyBuildMixIn#newBuild: A new build could not be created in job jenkins-operator-seed-job-dsl-seed
java.lang.IllegalStateException: JENKINS-23152: /var/lib/jenkins/jobs/jenkins-operator-seed-job-dsl-seed/builds/1 already existed; will not overwrite with jenkins-operator-seed-job-dsl-seed #1

Reloading configuration from disk via the appropriate option in Jenkins settings solves the problem. But that's something that shouldn't happen after restarting the jenkins-master pod.

Image: jenkins/jenkins:2.346.2-lts-alpine Kubernetes operator helm chart version: version: 0.6.2

brokenpip3 commented 1 year ago

I encountered this recently as well. I have solved it for now by slightly changing the backup.sh script. I added --exclude jobs/*job-dsl-seed to the tar command and overwrote the scripts in the default backup container by following the process in https://jenkinsci.github.io/kubernetes-operator/docs/getting-started/latest/custom-backup-and-restore/.

This ^ should be the solution here: excluding the seeds jobs (with a regex) from the history backup.

Adding good-first-issue.