[Feature] Extend gates as a pipeline "step" feature

tonglil commented 7 years ago

With the built in "gated builds" feature where it requires someone to approve, I was thinking about how to extend this feature as a step in the pipeline that someone can add and create additional "gates".

This could allow for a pipeline declaration that allows:

Different teams to control passing to the next stage of the pipeline
Multi-staged deployments, such as canary analysis deployments (gate, deploy: 5%, gate, deploy: 25%, gate, deploy 50%, gate, deploy 100%)

With the work being done in https://github.com/drone/drone-ui/issues/130, it would allow the person approving/rejecting the gate visibility as what the next step(s) is.

With each build being an Drone agent, and a deployment that supports scaling agents, this doesn't have a great impact on resource consumption or block the build queue since the build is "paused" during the "gate" and the only container running is the "gate" container (build/test containers are already finished).

In summary, this does not add a new "feature" to the pipeline, but rather expose the "gate" feature to the UI and users.

Sorry if this was brought up before and decided to be too complex.

bradrydzewski commented 7 years ago

Today you can enable multi-stage builds using the promotion (i.e. drone deploy) functionality. If you are not yet familiar check out http://docs.drone.io/deployments/. This allows you to string together multiple pipelines, which you could hypothetically gate in-between.

I think we should explore this capability more to see if it can meet your needs, and how it could be improved to feel more seamless. I think it enables the same workflow, but requires thinking about the problem in a different context.

In terms of the pausing builds, the challenge is that it fundamentally changes how drone works. Right now builds and servers are ephemeral. If we start pausing builds we have to maintain state (the workspace) on that server. What happens if we lose the server? How does it recover? How do we restore the workspace so it can resume? How do we know when the server is lost? Does drone have to implement server accounting? etc ...

I would prefer to keep drone ephemeral and avoid maintaining state. Perhaps we can design a solution around this constraint? Would love to get your thoughts.

tonglil commented 7 years ago

I think drone deploy will definitely fulfill some use cases.

Here is one that might be improved by this: applying terraform configurations.

Terraform updates usually applies in a 2-step manner (get current state, verify update, and apply). With a gated pause, we can gate the verify, and apply only when someone has inspected the update will not break anything or perform something strange.

Although I may be able to think of a way to re-code it to use drone deploy; the challenge here is persisting the verified state file for application.

Preliminary thoughts

Some preliminary thoughts on the technical feasibility:

"Pausing" requires you to use a plugin that can save/restore the state from some external location (like S3 or GCS).

This would mean there is a way to package up the workspace (like a ZIP) or some how serialize it for storage.

If the workspace could be saved along side the .drone.yml config being executed, that could mean the containers are still ephemeral: when the approve button is hit, it downloads the "bundle" and parse the next step's container from the attached ".drone.yml".

There could be a timeout that determines when the save/restore packaging operation takes place, since it could potentially be an expensive operation, that after the build has been paused for X (let's say 30 minutes), it starts packaging.

Idea

Another way to look at this is an automated "drone downstream" plugin on the same repo, with the "gating" feature applied only on the downstream execution (that execution being the next "set" of steps).

Can you visualize this other "approach"? (Sorry for using so much quotes.)

bradrydzewski commented 7 years ago

Although I may be able to think of a way to re-code it to use drone deploy; the challenge here is persisting the verified state file for application.

One notable improvement with version 0.5 and higher is that we support cache plugins. In 0.4 you could only cache and restore from local disk. In 0.5 and higher you can write a plugin with your own custom cache, sync and storage implementations.

In addition to local volume cache plugins (which behave similar to 0.4) there are now SFTP and S3 plugins that you can use to sync to a central, remote storage location. The benefit is that if your build (or deployment) pipeline executes on different servers they can still access and restore the cache.

"Pausing" requires you to use a plugin that can save/restore the state from some external location.

This would mean there is a way to package up the workspace (like a ZIP) or some how serialize it for storage.

As mentioned above, you could use a cache plugin to cache and restore artifacts (the workspace) so they are available when using drone deploy. The issue with caching is that it can be quite slow. I am hoping to see Docker implement volume snapshots. This could open up a new world of possibilities for us. Reference issue https://github.com/moby/moby/issues/33782

If docker has a fast, native way to extract volumes I think this would be much easier, faster, less prone to failure, and would make this feature (pausing a pipeline mid-step) much more feasible. So let's hope docker considers this capability 😄

Another way to look at this is an automated "drone downstream" plugin on the same repo, with the "gating" feature applied only on the downstream execution (that execution being the next "set" of steps).

This is probably feasible today with minimal change. You could execute a drone deploy at the end of your pipeline. Then you could gate deployments. Note that we don't have custom gating plugins enabled yet, but when available, this would be possible.

This doesn't solve performance issues with caching, but it might be a good place to start, and is something I could see us launching with 1.0.

tonglil commented 7 years ago

Would you want me to keep keep this issue open for idealization? Or close this issue (for now) and open a new issue to track "gating" as a plugin? It seems like that has to be the first step for any kind of mid-pipeline gating.

bradrydzewski commented 7 years ago

Would you want me to keep keep this issue open for idealization?

Keep it open for sure. I think pausing and resuming (or chaining) pipelines is important. I think drone does provide some simple workarounds (documented above) but at a minimum they need to feel more integrated and less like workarounds. Even better if we can come up with something more robust.

I am leaning toward some sort of snapshot capability where the workspace is snapshot and archived in pluggable storage (S3, etc) and then restored when the pipeline results (like you mentioned above in your preliminary thoughts). Snapshot would be slow and could take minutes to backup and restore, but it keeps our agents ephemeral and perhaps speed isn't the primary goal for larger, more complex pipelines that probably already take tens of minutes or even hours ... We would also need to come up with a mockup for the YAML

I definitely want to continue to brainstorm and see what we can come up with.

bradrydzewski commented 7 years ago

Lets also collect some use cases for pausing and resuming (or stringing together) pipelines. Here are some initial use cases that come to mind

Wait for approval
Wait for approval or wait N hours. Whichever comes first
Blue/Green deployments. Pause for healthchecks. Continue or Rollback
Canary deployments. Pause for testing. Cutover or Cancel

bradrydzewski commented 7 years ago

One additional thing to consider ... right now we don't have multi-machine fan-in or fan-out. This means if you are running a matrix builds, for example, there is no way to have a final step that runs after the matrix is complete, because a single pipeline is spread across machines. And a single pipeline in a matrix cannot wait for all other matrixes to complete.

I feel like this is tangentially related to this issue. They are different use cases, but they both deal with potentially moving pipelines workspaces across machines, pausing, resuming, etc. At least something to consider ... maybe we can kill two birds with one stone.

bradrydzewski commented 6 years ago

@tonglil turns out the multi-machine changes I'm making in 0.9 enable stringing together multiple pipelines. You can then block a pipeline pending approval. The yaml is being slightly adjusted (with backward compatibility) to accept the following:

pipeline:
  name: frontend
  steps:
    - name: build
      image: node
      commands:
        - npm install
        - npm run dist
    - name: test
      image: node
      commands:
        - go run test
---
pipeline:
  name: backend
  steps:
    - name: build
      image: golang
      commands:
        - go get
        - go build
    - name: test
      image: golang
      commands:
        - go test -cover

trigger:
  branches:
    - master

depends_on: [ frontend ]

In the above example we could (for example) execute the "frontend" pipeline and then once complete, block the "backend" pipeline pending approval. We just need to figure out exactly how we want to signal the pipeline requires manual approval. Any thoughts on the syntax you would like to use to facilitate this?

Is there any yaml prior art we can reference?

bradrydzewski commented 6 years ago

@tonglil also we need to find a better way to prevent tempering with the yaml. Currently we are checking to see if the yaml is new (has not been seen before) but in support of multi-machine builds we've changed the internal implementation, and this approach no longer works. We might have to revisit an (optional) signature file if you need a guarantee against tampering ☹️

ConradKurth commented 6 years ago

@bradrydzewski Would this feature be similar to what circle ci has in terms of manual approval buttons? https://circleci.com/blog/manual-job-approval-and-scheduled-workflow-runs/

bradrydzewski commented 6 years ago

I'm not really familiar with Circle, but Drone has the ability to block a build pending approval. The goal is to extend this capability so that Drone can pause mid-pipeline (e.g. before a deployment) to request approval.

screen shot 2017-03-19 at 11 42 56 am

ConradKurth commented 6 years ago

Ahh cool. The mid-pipline is what circle has the ability to do. How do you make the add the ability to decline or approve a build? I didn't see that in the docs

ConradKurth commented 6 years ago

@bradrydzewski Did you see this?

bradrydzewski commented 6 years ago

You mark the build as "protected" which currently on requires approval when the yaml changes. This will be expanded when 0.9 is released to use an API / webhook to optionally determine whether or not approval is required, and if so, will block the build.

bradrydzewski commented 6 years ago

I wanted to provide an update since multi-machine builds are on my mind. Drone now has the ability to run multiple pipelines, and is able to pause in between pipeline executions. This means we will be able to support gating in-between pipelines. For technical reasons I do not envision Drone being able to gate individual pipeline steps -- such a technical change does not look feasible based on how Drone is architected -- and re-architecting Drone to support multi-machine pipelines has me pretty burned out. So, I think we'll have to get creative and make this work.

Here is an example multi-machine pipeline (this is a working example) with an added field that indicates approval is required:

name: frontend
kind: pipeline

steps:
  - name: build
    image: node
    commands:
    - npm install
    - npm test

---
name: backend
kind: pipeline

pipeline:
  - name: build
    image: golang
    commands:
    - go build
    - go test

services:
- name: redis
  image: redis

---
kind: gate
name: approval

depends_on:
  - frontend
  - backend

---
kind: pipeline
name: after

steps:
  - name: deploy
    image: plugins/kubernetes
      ...

depends_on:
  - approval

The approval workflow is going to be delegated to plugins. Drone is not going to try to create a generic workflow tool. Instead we will provide a defined set of APIs and let teams build the approval workflows of their dreams, and easily integrate with Drone :)

mellena1 commented 6 years ago

@bradrydzewski Is the above syntax for approving between pipelines a 0.9 feature? Thanks!

bradrydzewski commented 6 years ago

@mellena1 the syntax is just a proposal. It is out of scope for 1.0 but could land in a future 1.x

madsonic commented 5 years ago

any progress on this? would love to have this feature

davidchua commented 5 years ago

+1 for the mid-step gate.

Is this feature currently worked on?

tboerger commented 5 years ago

As it's added to a generic milestone named 1.x.x you can see it's not worked on.

bradrydzewski commented 5 years ago

Also note that Drone will support gating between pipeline executions when multiple pipelines are defined in the yaml as seen in my previous comment. Support for mid-step gating has been deemed out-of-scope for this project.

jrm2k6 commented 4 years ago

@bradrydzewski Not sure if it is the right place, but is there any updates to share about this? I check the docs but didn't find anything related to this, and as it is still an open PR, I am wondering if it is something that is worked on or fell off the roadmap.

jrm2k6 commented 4 years ago

@bradrydzewski or anyone, just trying to get an update about this feature. Thanks!

bradrydzewski commented 4 years ago

@jrm2k6 this is not something we are actively working on. As mentioned above, Drone is not architected in a manner that would allow gating in-between pipeline steps which means this feature is out-of-scope for the project. Drone is architected in a manner that would allow gating in-between pipeline stages (e.g. multi-pipeline builds) and this is something we would like to provide in the future, but it has not been committed to our roadmap. In the meantime, many people find gating overlaps with our existing Promotions feature, so you may want to research promotions for your use case as a workaround.

jrm2k6 commented 4 years ago

@bradrydzewski Thanks for the answer. Promotions is what we are investigating.

bradrydzewski commented 4 years ago

@jrm2k6 if you have any questions about promotions or best practices, shoot a note in the public chatroom and we can try to help steer you in the right direction.

harness / gitness

[Feature] Extend gates as a pipeline "step" feature #2126

Preliminary thoughts

Idea