Unlock a locked pipeline if stage fails

mdaliejaz commented 10 years ago

Description

As a Go user i want a locked pipeline to get unlocked if a stage fails, so that a check-in having the fix triggers the pipeline without me having to unlock it.

Pipeline staying locked even if the stage fails was intentional. Consider a deployment pipeline. If the pipeline fails at a stage then it could be due to some environment issues which needs to be sorted before the stage could proceed and there is no point for another instance to start running. But if the issue is with deployment script then another checkin is needed and the pipeline should get triggered even if some stage had failed. So essentially user should have an option to unlock it if needed. Exact approach yet to be decided.

dhawal55 commented 10 years ago

Is there a way to vote on open issues? My vote goes to this issue. Locking pipeline should be optional so teams can decide if they need the pipeline locked when something fails. I hate having to unlock it manually from the UI.

kmugrage commented 10 years ago

@dhawal55 I'm not sure you're speaking about exactly the same issue. Locking a pipeline is optional.

I believe this enhancement says that if you've chosen to lock a pipeline you should have another set of options that unlocks it when it stops, even if it stopped because of a fail.

dhawal55 commented 10 years ago

Sorry for the confusion. I meant the same thing. If I choose the automatic pipeline locking option for my deployment pipeline and the pipeline fails on one of the stages (except the last one), the pipeline gets locked. A new check-in (with the fix) doesn't trigger the pipeline again till someone manually unlocks the pipeline. I would like the pipeline to pick up the next build in queue without having to manually unlock it.

sriramnrn commented 10 years ago

I wrote an objection to this feature, but then realized that it can be of much use in some cases.

Therefore, something like "Pipeline locking" followed by "Accept new requests when idle" (with the default being unchecked) would be useful.

-- Ram

On Mon, May 19, 2014 at 10:33 PM, dhawal55 notifications@github.com wrote:

Sorry for the confusion. I meant the same thing. If I choose the automatic pipeline locking option for my deployment pipeline and the pipeline fails on one of the stages (except the last one), the pipeline gets locked. A new check-in (with the fix) doesn't trigger the pipeline again till someone manually unlocks the pipeline. I would like the pipeline to pick up the next build in queue without having to manually unlock it.

— Reply to this email directly or view it on GitHubhttps://github.com/gocd/gocd/issues/106#issuecomment-43529897 .

Belenix: www.belenix.org Twitter: @sriramnrn

kief commented 9 years ago

The current behaviour confuses me. I assumed the reason for the option to lock a pipeline is to prevent it from running on multiple agents concurrently. But if a stage fails, the pipeline is done - I expect a commit to fix the issue will trigger a new run of the pipeline.

It's interesting that this only happens on pipelines with multiple stages. If I have a locked pipeline with one stage, failure doesn't keep the pipeline locked.

I can understand there are some case where you'd want a failure in a multi-stage locked pipeline to keep the pipeline locked, if there is some state (e.g. data) which needs to be manually fixed before the pipeline can be started from the beginning again. But I'd try to design pipelines so this isn't a problem. The option to keep a pipeline locked on failure IMO should be an option, not the default.

elimydlarz commented 9 years ago

We could also use this on our project.

In our case, we have a stage for deploying to CI (deploy-ci) and a stage for running browser tests on CI (test-ci). If build 58 is in the deploy-ci stage and build 59 is in the stage immediately before that, and Go CD alternately runs stages in both builds, it is possible that build 59 will run deploy-ci before build 58 runs test-ci, thus running browser-tests on the wrong build.

srinivasupadhya commented 9 years ago

Hack: I have added notification end-point. Should be available in 15.1.0. Hopefully you should be able to write a plugin for this? You can use pipeline unlock API to do this I suppose.

dpsenner commented 9 years ago

:+1: to make the behaviour what should be done when "locked pipeline fails" at least configurable (options would be "keep locked", "unlock automatically").

For us it is immensely painful having to unlock a pipeline whenever a integration test fails. On top of that the pipeline re-runs for several configurations and one ends up hitting that "unlock" button more than 100 times a day! Thus please bump this issues priority and get it fixed with the next release.

sgran commented 9 years ago

We use this code from cron. It's not pretty, but it gets us by until this is fixed with a config option.

#!/usr/bin/python

import requests
import sys

auth_user = 'XXXX'
auth_pass = 'YYYY'
pipelines = []
pipeline_conf = requests.get('http://localhost:8153/go/api/config/pipeline_groups', auth=(auth_user, auth_pass)).json()

for group in pipeline_conf:
    for pipeline in group['pipelines']:
        pipelines.append(pipeline['name'])

for pipeline in pipelines:
    status = requests.get('http://localhost:8153/go/api/pipelines/%s/status' % pipeline, auth=(auth_user, auth_pass)).json()
    history = requests.get('http://localhost:8153/go/api/pipelines/%s/history' % pipeline, auth=(auth_user, auth_pass)).json()
    building = False

    # Pipeline just created
    if not history['pipelines']:
        continue

    for stage in history['pipelines'][0]['stages']:
        if stage.get('result', None) is not None:
            if stage['result'] == 'Unknown':
                building = True
                break
    if building:
        continue

    if status['paused'] is False and status['locked']:
        unlock = requests.post('http://localhost:8153/go/api/pipelines/%s/releaseLock' % pipeline, auth=(auth_user, auth_pass))
        if unlock.status_code != 200:
            print "Failed to unlock %s" % pipeline

srinivasupadhya commented 9 years ago

@sgran - neat. Suggestion: Initially use status API only & if pipeline is locked then hit the History API to do the rest of the logic. That way you wont be hitting History APIs for pipelines that are not locked. Also, I have introduced Instance API which should be available in 15.1. I will see if I can add /api/pipelines/<pipeline-name>/latest which you can use instead of History API. It should definitely be cheaper in-terms of resource usage & hence faster.

sgran commented 9 years ago

Good call. Updated version attached if someone else wants it for now:

#!/usr/bin/python

import requests

auth_user = 'XXXX'
auth_pass = 'YYYY'
pipelines = []
pipeline_conf = requests.get('http://localhost:8153/go/api/config/pipeline_groups', auth=(auth_user, auth_pass)).json()

for group in pipeline_conf:
    for pipeline in group['pipelines']:
        pipelines.append(pipeline['name'])

for pipeline in pipelines:
    status = requests.get('http://localhost:8153/go/api/pipelines/%s/status' % pipeline, auth=(auth_user, auth_pass)).json()
    if not (status['paused'] is False and status['locked']):
        continue

    history = requests.get('http://localhost:8153/go/api/pipelines/%s/history' % pipeline, auth=(auth_user, auth_pass)).json()
    building = False

    # Pipeline just created
    if not history['pipelines']:
        continue

    for stage in history['pipelines'][0]['stages']:
        if stage.get('result', None) is not None:
            if stage['result'] == 'Unknown':
                building = True
                break
    if building:
        continue

    unlock = requests.post('http://localhost:8153/go/api/pipelines/%s/releaseLock' % pipeline, auth=(auth_user, auth_pass))
    if unlock.status_code != 200:
        print "Failed to unlock %s" % pipeline

dpsenner commented 9 years ago

Have you observed any negative side effects or even collateral damage (growing hard disk size, memory, ..) if this script is run regularly (i.e. every 5 seconds)?

dpsenner commented 9 years ago

Thanks for sharing that script. I've adapted it a little bit such that the host can be configured and that it writes logs both to stdout and to a file, while at the same time keeping the noise low. Take it:

#!/usr/bin/python
import requests
import datetime
import sys
import time
import traceback

host = 'go-server-host'
auth_user = 'replace'
auth_pass = 'me'
# path to a logfile or None if it should not log to a file
path_to_logfile = "logfile.log"
pipelines = []
pipeline_conf = requests.get('http://{0}:8153/go/api/config/pipeline_groups'.format(host), auth=(auth_user, auth_pass)).json()

for group in pipeline_conf:
    for pipeline in group['pipelines']:
        pipelines.append(pipeline['name'])

def printStatus(msg, pipe = None):
    message = "{0} {1}\n".format(datetime.datetime.now(), msg)
    if pipe is None:
        pipe = sys.stdout
    if pipe == sys.stdout and path_to_logfile:
        # log to file
        with open(path_to_logfile, "a") as fh:
            fh.write(message)
    pipe.write(message)

def printPipelineStatus(pipeline, message, pipe=None):
    formattedMsg = ("{0} {1}").format(pipeline, message)
    if pipe is None:
        pipe = sys.stderr
    printStatus(formattedMsg, pipe)

def main():
    pipelinesBuilding = {}
    while True:
        try:
            for pipeline in pipelines:
                status = requests.get('http://{0}:8153/go/api/pipelines/{1}/status'.format(host, pipeline), auth=(auth_user, auth_pass)).json()
                if not (status['paused'] is False and status['locked']):
                    #printPipelineStatus(pipeline, "not locked")
                    if pipeline in pipelinesBuilding:
                        del pipelinesBuilding[pipeline]
                        printPipelineStatus(pipeline, "finished building", sys.stdout)
                    continue

                history = requests.get('http://{0}:8153/go/api/pipelines/{1}/history'.format(host, pipeline), auth=(auth_user, auth_pass)).json()
                building = False

                # Pipeline just created
                if not history['pipelines']:
                    continue

                for stage in history['pipelines'][0]['stages']:
                    if stage.get('result', None) is not None:
                        if stage['result'] == 'Unknown':
                            building = True
                            break
                if building:
                    if pipeline not in pipelinesBuilding:
                        pipelinesBuilding[pipeline] = True
                        printPipelineStatus(pipeline, "started building", sys.stdout)
                    continue
                elif pipeline in pipelinesBuilding:
                    del pipelinesBuilding[pipeline]
                    printPipelineStatus(pipeline, "finished building", sys.stdout)

                unlock = requests.post('http://{0}:8153/go/api/pipelines/{1}/releaseLock'.format(host, pipeline), auth=(auth_user, auth_pass))
                if unlock.status_code != 200:
                    printPipelineStatus(pipeline, "failed to unlock!!", sys.stdout)
                else:
                    printPipelineStatus(pipeline, "unlocked", sys.stdout)
        except requests.exceptions.ConnectionError, ex:
            printStatus(ex.message, sys.stdout)
        except ValueError, ex:
            printStatus(ex.message, sys.stdout)
        time.sleep(5.0)

if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        pass
    except:
        printStatus(traceback.format_exc(), sys.stdout)

sgran commented 9 years ago

Funny you should ask about memory issues - I'm looking at #496 at the moment. We are seeing what look like memory leaks, but it's not clear it's this script - we have several that look at the various APIs for information.

Ehekatl commented 8 years ago

+1, I already using api for release lock, but it's not possible if I call it to unlock a pipeline it self while it's still runing, this is design problem, have a cron job to check it every 5s ? what a horrible idea, why not let gocd a little bit more user-friendly and go straight way?

RehanSaeed commented 8 years ago

+1

PredatorVI commented 8 years ago

We are approaching 2 years since this opened and was wondering what the status was for a fix? Is it being planned? Seems to have stalled

trisapeace commented 8 years ago

+1 I would definitely use this feature.

ntodorov commented 8 years ago

+1 very important for DB deployment with several stages - starting with dropping the DB and then creating it.

mbursi commented 7 years ago

+1 much needed feature!

zac-rivera commented 7 years ago

+1 This is an important feature that is needed for certain stages in our system.

jfoboss commented 6 years ago

+1 We need this feature too! There are dozens DevOps operations when pipeline could fail. Manual unlocking is very annoying :(

silasg commented 6 years ago

+1

arvindsv commented 6 years ago

I have been thinking about the scenarios for this, and here's what I think:

Ideally, we want to break up the isLocked attribute on pipeline config, into two. However, having a lockable pipeline seems to imply a single instance, to me. I mean, it makes no sense to have a lockable pipeline, where there can be multiple instances. If not, and something fails, then the other concurrent runs will need to continue, which defeats the purpose of locking.

So, it looks like the options should be:

Automatically lock pipeline (run single instance and lock pipeline on failure) [existing option]
Run only one instance at a time (run single instance and do not lock pipeline on failure)

If not, the other option would be to have one which says: "Unlock pipeline even on failure". That would seem more natural, but makes sense only when "Automatically lock pipeline" is set.

Finally, what does it mean if you have a pipeline with three stages, where the third stage is manual and this attribute is set. Does it mean that no new pipeline run happens, unless the third stage has been triggered and has finished? I would suppose so.

Any thoughts?

PredatorVI commented 6 years ago

My $.02 is that the terminology is confusing. Using different terminology pipeline scheduling could be:

Serial - Only a pipeline that is enabled (not paused) and idle can be scheduled.
Parallel - Always allow an enabled pipeline to be scheduled, even if it is already running.

Then there is the issue of pausing a pipeline in case of failure. This would affect both serial and parallel pipelines where in either case, a pipeline should be prevented from being scheduled if a previous failure has occurred. This seems to be an edge case since a normal continuous deployment workflow is meant to happen ... um, continuously, even if/when failures happen since new check-ins should trigger new pipelines even if a previous pipeline failed (again, the normal C/D workflow).

In my mind, the only edge case that I can think of where someone might want to suspend a pipeline on failure is:

An issue needs to be investigated and the pipeline needs to pause to preserve state for debugging/remediation and to prevent unnecessary spamming of user notifications. This could be due to a pipeline under construction, or having found a significant defect in the product or process.

If a pipeline is causing havoc, then it can be manually paused or set to be manually triggered and would be the exception rather than the rule, but I could be wrong.

A possibly implementation of the above in the UI might look like :

[ ] Parallel scheduling [ ] Pause on failure

Depending on the desired default, a pipeline may be scheduled serially by default, with a setting to allow parallel scheduling. I actually prefer serialized scheduling by default since most of my pipelines include automated deployment and QA steps and parallel scheduling would mess up the automation (but that's just me).

Next if pausing a pipeline on failure is desired, this option could be selected. This is effectively the automated equivalent of "Pausing" the pipeline until remediation has occurred. Maybe it could even piggyback on the existing pause mechanism and show "Paused by: System" with "Reason: Job [jobName] failure".

The concept of a "lock" seems to be more of an implementation detail for serial pipelines and I think using that terminology is confusing at the user level.

FredrikWendt commented 6 years ago

Out of our 341 pipelines, some 20 needs serialization due to a pattern of "aqcuire non-virtual machine/do something with that machine/tear down". They typically need some specific hardware, or a VM behind some specific hardware/network equipment. All other pipelines don't suffer from this, and the work can be more dynamic. Parallel scheduling is the default mode for us.

bliles commented 6 years ago

+1 for allowing only one build at a time on a pipeline without requiring the pipeline to be manually unlocked on failure.

We have pipelines that have very simple build steps, but multiple deployments. There are configuration artifacts that are generated for each unique deployment so we need them to block each other, but we don't want to have to manually unlock the pipeline on failure.

rajiesh commented 6 years ago

Verified on 17.12.0 (5626-cb7df2ffe421e43f2a682a7a323cb3a3e30734cc)

gocd / gocd

Unlock a locked pipeline if stage fails #106

Description