Locks are not released if Jenkins crashes

gabloe commented 5 years ago

If jenkins crashes and it is brought back up any locks that were locked previously aren't released even after the pipeline creating the lock has failed/completed.

TobiX commented 5 years ago

Interesting, while we have some tests which test "hard" kills of single Jenkins jobs, we don't have anything to check what happens when Jenkins itself dies... Which durability setting are you using with your pipelines? I had the expectation that pipelines would be able to release a lock if they were properly resumed...

jimklimov commented 5 years ago

We have had this issue in our local setup, and in fact it is a difficult philosophical problem to tackle: nowadays, Jenkins often runs as a master doing just dispatching and accounting with agents doing the actual work. While some agent types (e.g. spawned over an SSH session to some system) do have lifecycles tied to the master JVM uptime and connectivity, many are independent (e.g. swarm agents, JNLP, different dockers and VMs through some management service of theirs, etc.)

This means (and I think was a design goal or bonus, rather explicitly with pipelines) that in many cases, restart of the master (crash, maintenance, updates, etc.) should have no impact on the actually running jobs, as long as they are spawned as child processes of some agent and that one survives the master restart and just reconnects when time comes.

Due to this, we chose against trying to "fix" this plugin and have it reset the list of lockable resources during initialization; instead, we added a custom job (pipeline with groovy tapping into the master's Java objects) to run once in a while and release resources that are not factually used according to the custom logic we defined there.

That said, it may make sense to after all "fix" the plugin to reset the list in a smart fashion: if the resource is held by a build (name, number) and this build is not actually running/queued/... (maybe suffices that it has some result/status?) then release those. But this would have to be careful about running this logic after the server initialization has stabilized, so the list of known builds is reliable.

Locks held by humans or by jobs still running thanks to their surviving agents should stay locked after a master restart.

Additionally note: I am not sure if there is a use-case in the wild also for jobs passing the lock from one to another. We have an equivalent with a wrapper-job holding and eventually releasing the lock, and calling many other jobs that constitute our legacy equivalent of a testing pipeline. But I've also tested that on Java side (so in pipeline or in system groovy steps) it is possible to just assign a resource as held by some string (user name, job reference, ...) and so somebody somewhere might have such custom passing of the olympic token (e.g. named via build args) without a wrapper job.

To cater for such possibility, or others unthought of yet, I guess such auto-cleanup should be optional (and off by default).

jimklimov commented 5 years ago

For clarity, our custom cleanup logic is along the lines of (frankencode from SE/SO/google/...):

@NonCPS
public boolean cleanup_bad_locks(boolean READONLY_SANITYCHECK = false, boolean DEBUG = false) {
    int countCleaned = 0
    def all_lockable_resources = org.jenkins.plugins.lockableresources.LockableResourcesManager.get().resources

    all_lockable_resources.each { r->
        if (r.isLocked()) {
            b = r.getBuild()
            if (b) {
                if ( ! b.isBuilding() && ! b.getResult().equals(null) ) {
                    // No big deal to be noisy here, this is not expected to fire often - so no tracing toggle
                    println "MISMATCH FOUND: build:" + b + " is not building and result is " + b.getResult() + " yet the lock " + r + " is locked."
                    println "ACTION : RELEASE LOCK " + r

                    if (DEBUG) {
                        println "getLockCause:" + r.getLockCause()
                        println "getDescription:" + r.getDescription()
                        println "getReservedBy:" + r.getReservedBy()
                        println "isReserved:" + r.isReserved()
                        println "isLocked:" + r.isLocked()
                        println "isQueued:" + r.isQueued()
                    }

                    //release the lock
                    println (READONLY_SANITYCHECK ? "Would be " : "") + "Releasing..."
                    if (!READONLY_SANITYCHECK) r.reset()
                    countCleaned++

                    if (DEBUG) {
                        println "getLockCause:" + r.getLockCause()
                        println "getDescription:" + r.getDescription()
                        println "getReservedBy:" + r.getReservedBy()
                        println "isReserved:" + r.isReserved()
                        println "isLocked:" + r.isLocked()
                        println "isQueued:" + r.isQueued()
                    }

                    println ""
                }
            }
        }
    }

    if ( countCleaned == 0 ) {
        println "SUCCESS : No resources were mis-allocated to need a correction"
    }

    return ( countCleaned == 0 )
}

This form is for groovy in a Jenkinsfile, body of the routine should work largely unchanged in e.g. $JENKINS_URL/script console.

jenkinsci / lockable-resources-plugin

Locks are not released if Jenkins crashes #137