Deployment gating - Githubissues

jku commented 4 months ago

Currently there is no (human) deployment gating: if the tests pass, a new repository is published whenever there are changes whether those changes are timestamp updates or more meaningful changes.

Some questions to answer (for both staging and production):

Do we want gating?
For all changes? Or maybe only for changes that cover something else than timestamp.json alone?
is GitHub release environment (with reviewers) an appropriate choice?

haydentherapper commented 4 months ago

Assuming our tests are sufficient, and we test at least one newer Sigstore client and Cosign, no gating long-term seems reasonable.

In the short-term, while we're gaining confidence over the process, I would gate for root updates only. Timestamp updates seem safe enough to automatically deploy.

Separate question, do we expect target file updates outside of root updates? Does tuf-on-ci allow for that? Previously, these would always happen at the same time, given that we had to gather root key holders. If we do expect that, maybe we also gate on target file updates, so that we can confirm that the target file works in whatever context it's used in.

jku commented 4 months ago

Separate question, do we expect target file updates outside of root updates? Does tuf-on-ci allow for that?

Yes this is allowed. Multiple roles metadata can be changed in one signing event but that is not guaranteed in any way.

(a future feature request is likely an easy way to force role resigning: currently you'll get separate signing events when a role is about to expire -- it would make sense if maintainers could easily decide that "let's resign targets now as well since we're all signing root already") https://github.com/theupdateframework/tuf-on-ci/issues/198

jku commented 4 months ago

I've got a POC running in https://github.com/jku/tuf-on-ci-sigstore-test:

My original idea was to use gcloud storage rsync --dry-run to find out if the deployment can be automatic or not: unfortunately the gcloud storage "API" is not great: We'll need to use an actual API if we want to do this without downloading the files. So current implementation downloads all current GCS content to do the comparison locally (like prod root-signing does)
The POC does not actually use the environment even though it is defined: This is likely because deploy-to-gcs.yml is a reusable workflow which does not have its own "workflow identity": so it can't use an environment either. I will have to refactor the deployment to two separate pieces:
1. deploy-to-gcs-stage-1 (figures out if this is a timestamp change only, deploys timestamp if yes)
2. deploy-to-gcs-stage-2 (only runs if this is not only a timestamp change, is gated behind deploy env, deploys full repo)

jku commented 4 months ago

I will have to refactor the deployment to two separate pieces:

this did not work either: environments cannot be used with a "uses:" job.

There are now two working approaches:

68 runs "both stages" in publish.yml itself: this makes publish.yml a bit long but otherwise feels nice
The POC in https://github.com/jku/tuf-on-ci-sigstore-test uses an internal action that is called twice from publish.yml: I was hoping it would simplify and allow code reuse but I'm not convinced it did

jku commented 3 months ago

Documenting the state of things:

The draft PR is functional
there has been quite a lot of discussion and it is still not clear if adding deployment gating is worth the complexity: with the improved automated client testing there does not seem to be much reason for it
- just about the only thing I can think of is issues in test workflows: test workflows could be buggy and could result in green checkmarks even though something failed
I am still willing to implement this if there are strong opinions ...

jku commented 2 months ago

Copying some content from the draft PR before I close it :

The flow currently looks like this:

graph TD;
    merge{"signing event<br/>(merges to 'main')"}-->online-sign;
    online-sign-period[online role in signing period]-->online-sign["online signing<br/>(merges to 'main' and 'publish')"];
    online-sign-->publish-pages[publish to Pages];
    publish-pages-->test-pages[test Pages with clients];
    test-pages--this is the critical point where we can add manual review-->publish-gcs[publish to GCS];
    publish-gcs-->test-gcs[test GCS with clients];

The takeaways on branches and deployments are

both main and publish branch merges as well as pages deployment are done before tests
automated (pages) tests prevent deployment to GCS if tests fail
GCS publish is also tested but that is informational: the tests do not prevent anything from happening

So the question is this: will maintainers do additional manual testing using the Pages-published repository before publish to GCS if we give them the chance?

Potential human review flow

Rhombus is used to signify human interaction in the diagram
The downside of this design is some additional complexity in the publishing workflow
The draft pr implements this

graph TD;
    merge{"signing event<br/>(merges to 'main')"}-->online-sign;
    online-sign-period[online role in signing period]-->online-sign["online signing<br/>(merges to 'main' and 'publish')"];
    online-sign-->publish-pages[publish to Pages];
    publish-pages-->test-pages[test Pages with clients];
    test-pages--if only timestamp changes-->publish-gcs-light[publish timestamp to GCS];
    test-pages--if additional metadata or targets changes-->deployment{Deployment review<br/>or delay};
    deployment-->publish-gcs-full[publish full repository to GCS];
    publish-gcs-light-->test-gcs[test GCS with clients];    
    publish-gcs-full-->test-gcs[test GCS with clients];

sigstore / root-signing-staging

Deployment gating #54

68 runs "both stages" in publish.yml itself: this makes publish.yml a bit long but otherwise feels nice

The flow currently looks like this:

Potential human review flow