jenkinsci / templating-engine-plugin

create tool-agnostic, templated pipelines to be shared by multiple teams
https://jenkinsci.github.io/templating-engine-plugin/latest/
Apache License 2.0
171 stars 58 forks source link

[Bug]: JTE Pipelines resuming execution after successful run #328

Open brosmar opened 1 year ago

brosmar commented 1 year ago

Jenkins Version

CloudBees CI Client Controller Latest 2.414.2.2-rolling

JTE Version

2.5.3

Bug Description

Same issue as: https://github.com/jenkinsci/templating-engine-plugin/issues/309 https://github.com/jenkinsci/templating-engine-plugin/issues/187

If more than 3 People are reporting the same problematic behavior than the issue should not be closed.

image

all the Jobs where formerly green as the job in the first row.

Relevant log output

And even if the Job result was succesful the JTE Templating job is arbitary restarted:

10:52:37  stepFailed: false 10:52:37  result: null 10:52:37  current: SUCCESS 10:52:37  ------------------------------------------------------------------------------------------------- 10:52:37  end Notify step null/null (Lifecycle Hook) 10:52:37  ------------------------------------------------------------------------------------------------- 10:52:37  ------------------------------------------------------------------------------------------------- 10:52:37  [Pipeline] End of Pipeline 10:52:37  Finished: SUCCESS 09:06:15  Resuming build at Sat Sep 23 09:06:15 CEST 2023 after Jenkins restart 09:06:15  [Pipeline] End of Pipeline 09:06:15  java.io.FileNotFoundException: /var/jenkins_home/jobs/MarketData/jobs/XENTRIC/jobs/visitorscenter/jobs/external-ui/jobs/build-ui/branches/develop/builds/41/program.dat (No such file or directory) 09:06:15     at java.base/java.io.FileInputStream.open0(Native Method) 09:06:15      at java.base/java.io.FileInputStream.open(FileInputStream.java:219) 09:06:15    at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157) 09:06:15      at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.openStreamAt(RiverReader.java:196) 09:06:15     at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.restorePickles(RiverReader.java:140) 09:06:15   at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:804) 09:06:15     at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:770) 09:06:15   at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:728) 09:06:15   at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:582) 09:06:15     at hudson.model.RunMap.retrieve(RunMap.java:233) 09:06:15   at hudson.model.RunMap.retrieve(RunMap.java:61) 09:06:15    at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:660) 09:06:15     at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:642) 09:06:15     at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:540) 09:06:15      at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:240) 09:06:15     at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:234) 09:06:15   at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:105) 09:06:15   at jenkins.model.PeepholePermalink.resolve(PeepholePermalink.java:105) 09:06:15     at hudson.model.Job.getLastCompletedBuild(Job.java:990) 09:06:15    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$PipelineInternalCalls$1.writeTo(CpsFlowExecution.java:2052) 09:06:15     at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:418) 09:06:15     at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:353) 09:06:15     at com.cloudbees.jenkins.support.SupportPlugin$PeriodicWorkImpl.lambda$doRun$0(SupportPlugin.java:946) 09:06:15  Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 121af19f-483b-46d0-8c50-87e831d00429 09:06:15  Caused: java.io.IOException: Failed to load build state 09:06:15    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:878) 09:06:15      at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:874) 09:06:15      at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5$1.run(CpsFlowExecution.java:951) 09:06:15      at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38) 09:06:15     at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 09:06:15   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 09:06:15      at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) 09:06:15     at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) 09:06:15     at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) 09:06:15   at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51) 09:06:15     at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 09:06:15   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 09:06:15      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 09:06:15   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 09:06:15   at java.base/java.lang.Thread.run(Thread.java:829) 09:06:18  Finished: FAILURE

Steps to Reproduce

Install and use the JTE in day to day business in the above configuration and you will get the behavior.

steven-terrana commented 1 year ago

I am happy to leave this issue open.


Install and use the JTE in day to day business in the above configuration and you will get the behavior

Is there any more information you're able to provide?

The primary blocker for resolving this bug has been the absence of a consistently reproducible test case that could be translated into a failing unit test from which to begin debugging.

brosmar commented 1 year ago

Which information beside the two redundant issues do you think i should provide:

https://github.com/jenkinsci/templating-engine-plugin/issues/309 https://github.com/jenkinsci/templating-engine-plugin/issues/187

The problem cannot be connected to a specific job run. But it affects all Jobs that are using the JTE. In the moment that are far more than 100 Jobs. And more than 1000 Build Results. No other Job is affected. All JTE jobs build results that where formerly green and ok are tagged red as in the above screenshot. All at once. But the event that causes the behavior is yet unknown. You can imagine that the users of our templates are heavily annoyed. The CloudBees support is not helpful in this case because the JTE is not a supported CloudBees plugin.

Sorry for that I am not able to give you more details. Maybe you can request specific Information. I will try to get it from our Jenkins operations team.

Kind regards Martin

brosmar commented 1 year ago

Hello JTE Team. I can add the following Information.

Maybe this helps by the investigation for the reason.

brosmar commented 11 months ago

Hello JTE Team. I have feedback from the Jenkins Cloudbees Team. They have analyzed the issue and gave me the hint to share this information with you. Maybe this will help to find the root cause.

Here I the Answer form the CloudBees Support:

I've discussed the issue with some colleagues in the Engineering team. The done attribute in the execution build.xml drives the resume on startup: https://github.com/jenkinsci/workflow-cps-plugin/blob/3817.vd20b_7e2b_692b_/doc/persistence.md. As I anticipated, this value is set to false in your builds even if the execution completed successfully:

...
<done>false</done>
<resumeBlocked>false</resumeBlocked>
</execution>
<completed>false</completed>
...

JTE plugin seems to inject some logic around the pipeline run execution. There are lifecycle hooks that you can define, in particular:

https://github.com/jenkinsci/templating-engine-plugin/blob/0af836f6465f80a078a02c6[…]3/docs/how-to/library-development/lifecycle-hooks-on-failure.md

We tend to believe that this implementation might be breaking the and attributes in the build.xml file. If this is only happening in JTE jobs, you might want to share this finding with the plugin maintainers.

On the other hand, the property seems to be not editable because it is configured from a template. However, when you use disableResume() in the Jenkinsfile, it doesn't pass the property to the job, which seems to be a bug that you could report to the plugin maintainer.

lvalverderodriguez commented 10 months ago

Hi team,

It seems that JTE pipelines ignore the value of resumeBlocked in the build.xml and/or the pipeline property disableResume in the configuration.

Symptom

What is the end user experiencing? Failed JTE Pipelines are getting resumed after a Jenkins restart even if resume is disabled in the pipeline configuration. It happens regardless the syntax used, declarative or scripted.

Evidence/Detail

What information has been collected or researched so far that helps with the analysis The done and resumeBlocked attributes in the execution build.xml drive the resume on startup: https://github.com/jenkinsci/workflow-cps-plugin/blob/3817.vd20b_7e2b_692b_/doc/persistence.md.

Point to relevant files if appropriate The issue can be observed using any of the code below as JTE pipeline code:

// Scripted syntax
properties([disableResume()])

node {
    echo "Hello World!"
    sleep 60
    echo "Bye World!"
    }
// Declarative syntax
pipeline {
    agent none
    options { disableResume()
            }
    stages {
        stage('Example') {
            agent any
            steps {
                echo 'Hello World'
                sleep 60
                echo "Bye World!"
            }
        }
    }
}

Reproduction Steps

How to reproduce the issue

1/ Install JTE plugin 2/ Create a JTE pipeline providing pipeline configuration from console:

properties([disableResume()])

node {
    echo "Hello World!"
    sleep 60
    echo "Bye World!"
}

3/ Abruptly restart the controller before the job has finished successfully. 4/ Check that the pipeline execution was attempted to be resumed. 5/ Create a regular pipeline providing pipeline configuration from console:

properties([disableResume()])
node {
    echo "Hello World!"
    sleep 60
    echo "Bye World!"
}

6/ Abruptly restart the controller before the job has finished successfully. 7/ Check that the pipeline execution was not attempted to be resumed (as expected).

Has there been a successful attempt to reproduce the issue? Yes, following the steps above in CloudBees CI Client Controller 2.426.1.1-rolling.

If issue is intermittent/not reproduceable, say that. It is consistent.

What is expected behavior vs the actual behavior? It is expected that builds are not resumed if resumeBlocked is set to false in the JTE pipeline build.

I hope this helps with the investigation.

brosmar commented 9 months ago

@steven-terrana Hello Steven the above post is from the cloudbees support team. They had investigated the problem an found that disableResume flag seems to be ignored or manipulatd by your templating engine.

Is this information helpful?

madhu91s commented 4 months ago

Is there a solution for this problem? We have the same problem in our organization too.

cokieffebah commented 4 months ago

@brosmar @madhu91s looking at this ticket instead of #309

madhu91s commented 4 months ago

Just as Info to reproduce the scenario: We have been using Clodogu Systems with integrated Git, Jenkins as Docker containers. Jenkins is scheduled for an overnightly restart everyday. That's when JTE Plugin (after restart) cannot fetch the actual status of the job but instead fails on a particular stage and marks all the previous builds as failed (just like in the image @brosmar posted). Looking at Jenkins logs did not really help.

cokieffebah commented 4 months ago

@madhu91s it would be really helpful is you could give us a minimal public repository: JTE configuration and target build repository, that replicates the problem.
Also is it only replicable in Cloud Bees Controller and not Jenkins LTE ? I will have to get management signoff to get Cloud Bees, mostly to check that the license does not unexpectedly bind my company. Thanks in advance