Add support for allowing skipped checks in `has_successful_status`

iainlane commented 5 months ago

The has_successful_status predicate currently requires predicates to be present and successful in order to pass. But workflows can be run conditionally - for example only if certain paths change - and it is currently not very convenient to write a policy which considers such skipped workflows as passing. The only feasible workaround is to duplicate the path filters in the policy, and this quickly gets unwieldy and prone to getting out-of-sync in large repositories.

Here we add direct support for specifying such rules. This is done by introducing a new alernative form that has_successful_status predicates can take:

has_successful_status:
  options:
    skipped_is_success: true
  statuses:
    - "status 1"
    - "status 2"

In this mode, we will consider the skipped result as acceptable. The current form:

has_successful_status:
  - "status 1"
  - "status 2"

remains supported. We have done this by implementing a custom unmarshaling function to be able to handle both forms.

Closes: #760

palantirtech commented 5 months ago

Thanks for your interest in palantir/policy-bot, @iainlane! Before we can accept your pull request, you need to sign our contributor license agreement - just visit https://cla.palantir.com/ and follow the instructions. Once you sign, I'll automatically update this pull request.

iainlane commented 5 months ago

I'll square off the CLA question internally as soon as I can, but in the meantime hopefully we can proceed with reviews 😁

bluekeyes commented 5 months ago

Thanks for this proposal! Looking at an actual implementation, I'm now thinking this could work better as a new predicate that deprecates has_successful_status. This new has_status predicate would let you specify both the contexts and the allowed conclusions:

# Option 1: same conclusions for all contexts
has_status:
  conclusions: ["success", "skipped"]
  contexts:
    - "status 1"
    - "status 2"

# Option 2: conclusions per context
has_status:
  "status 1": ["success"]
  "status 2": ["success", "skipped"]

Internally, we can convert the has_successful_status predicate in to this new one to avoid duplicating logic. The new predicate also avoids having two forms of a single predicate and allows policies that can work with failure, neutral, and other conclusions. What do you think?

iainlane commented 5 months ago

Cheers for the review 👍

Thanks for this proposal! Looking at an actual implementation, I'm now thinking this could work better as a new predicate that deprecates has_successful_status.

Sure, that makes sense. It was a bit awkward that we had two forms in one predicate.

Take a look at what I've got now? It's fundamentally the same as what I had before, except now:-

has_successful_status is kept separate
Any conclusion can be given, instead of only allowing skipped as we did before

We've still got a custom unmarshal function here, which handles unpacking either form into the same struct so they can share the implementation.

A bit more stuff has appeared to handle the list of statuses but it's not too bad.

# Option 1: same conclusions for all contexts
has_status:
  conclusions: ["success", "skipped"]
  contexts:
    - "status 1"
    - "status 2"

# Option 2: conclusions per context
has_status:
  "status 1": ["success"]
  "status 2": ["success", "skipped"]

I went for the first one. You can express the second in terms of it as far as I can see, and it feels to me like it'll be more common to want to apply the same conclusions to a whole predicate than select them individually. But I don't feel that strongly about this - it can be changed to the other form easily enough.

I've used statuses rather than contexts. That's only because I've not used that word in the context of GitHub CI and I found it bounced off me a bit. But it also doesn't matter too much.

iainlane commented 4 months ago

I was just testing this in a real deployment, instead of using the testsuite... and this doesn't work like I thought it did.

Given this workflow

on:
  pull_request:
    paths:
      - a/**

jobs:
   # some jobs here

There is nothing reported when the job isn't run due to the path filter not being matched. Or at least nothing that I can find.

I think the PR is still useful for conditional jobs, but not conditional workflows, unless there is a way we can see those.

bluekeyes commented 4 months ago

There is nothing reported when the job isn't run due to the path filter not being matched. Or at least nothing that I can find.

Yeah, that sounds right. If a workflow trigger doesn't match, GitHub skips doing anything with the workflow. Posting skipped statuses for the jobs probably requires doing at least some evaluation to determine what jobs exist.

It's pretty difficult to account for things like this, since all statuses start out as missing until the reporting system responds to a webhook and posts the initial value. When Policy Bot evaluates a PR, there's no way that I know of to distinguish between a status that will never appear and one that will appear, but just hasn't appeared yet.

If you're able to share more details about what you want to achieve with your policy (maybe on the original issue), I can try to think about alternate approaches that might work.

iainlane commented 4 months ago

Looks good, thanks for iterating on this with me. Let me know when you have the CLA sorted out (or if you have any questions about it) and then we can merge this.

I got the reply from our legal folks earlier. It's good to go and I've signed it.

Thanks - if you could hit the button for me, please. 👍

I'll get to the other PRs and issues tomorrow now I shouldn't be blocking from this side any more.

stevoland commented 4 months ago

Thanks so much to both of you, excellent stuff!

bluekeyes commented 4 months ago

@iainlane while looking at #807, I realized we may have a bug related to this: the handler for the check_run event, and probably the handler for the status event, exits early when the conclusion of the incoming status is not success. This was an optimization to make fewer requests, but it means we could miss evaluating a policy that requires some other status if there are no other triggering events.

Unfortunately, this could complicate the request optimizations you might be looking at. While we have the "trigger" concept for policies that deal with statuses, it doesn't include the different conclusions, so we may needlessly evaluate policies on pending events when they could only be satisfied by successes.

iainlane commented 4 months ago

@iainlane while looking at #807, I realized we may have a bug related to this: the handler for the check_run event, and probably the handler for the status event, exits early when the conclusion of the incoming status is not success. This was an optimization to make fewer requests, but it means we could miss evaluating a policy that requires some other status if there are no other triggering events.

Unfortunately, this could complicate the request optimizations you might be looking at. While we have the "trigger" concept for policies that deal with statuses, it doesn't include the different conclusions, so we may needlessly evaluate policies on pending events when they could only be satisfied by successes.

Ah, yeah, hmm. Good spot.

Would it be acceptable to look for status being completed instead? So we'd only react when runs are finished, but now if they finished with any conclusion.

Or would we be looking at figuring out to get at the set of interesting conclusions in the handler? Feels tricky.

bluekeyes commented 4 months ago

Yeah, I think restricting has_status to completed conclusions and then looking for that would be reasonable. If someone has a compelling reason to write a policy that matches on pending statuses in the future, we can revisit at that point.

palantir / policy-bot

Add support for allowing skipped checks in `has_successful_status` #789