telia-oss / github-pr-resource

Github pull request resource for Concourse
MIT License
182 stars 6 forks source link

Out-of-order commits on different PRs result in skipped builds #26

Open seanmailander-orange opened 6 years ago

seanmailander-orange commented 6 years ago

We've been evaluating this resource, and have observed skipped CI builds "randomly"

Broadly, I'm looking for a path to more robust handling of potentially out-of-order commit timestamps

Our pipeline is fairly standard, and triggered via webhooks

groups: []
resources:
- name: pull-request
  webhook_token: ((token))
  type: pull-request
  source:
    access_token: ((github-token))
    repository: [some-repo]
  check_every: 24h
resource_types:
- name: pull-request
  type: docker-image
  source:
    repository: itsdalmo/github-pr-resource
...
[snip]
...
jobs:
- name: build
  plan:
  - get: [our-repository]
    trigger: true
    resource: pull-request
    version: every
...
[snip]

After further investigation, I've managed to reproduce reliably with the following workflow in a test repository (commits with timestamps separated by 1 second):

git checkout pr-test-1 && git commit --allow-empty -m "trigger build" && git checkout pr-test-2 && git commit --allow-empty -m "trigger build" && git push origin pr-test-2 && git push origin pr-test-1

Also reproduces the "skipped build", when the order is reversed (so nothing special about a particular branch):

git checkout pr-test-2 && git commit --allow-empty -m "trigger build" && git checkout pr-test-1 && git commit --allow-empty -m "trigger build" && git push origin pr-test-1 && git push origin pr-test-2

What does not reproduce (commits with timestamp to the same second):

git checkout pr-test-1 && git commit --allow-empty -m "trigger build" && git checkout pr-test-2 && git commit --allow-empty -m "trigger build" && git push origin pr-test-1 pr-test-2

I'm curious about the logic on https://github.com/itsdalmo/github-pr-resource/blob/master/check.go#L38, and wondering if the Github GraphQL is potentially mangling the commit timestamps, or perhaps does not guarantee the commits to appear across PRs in the same moment? I.e. perhaps when the GraphQL API is queried from the first webhook-initiated check, PR-2 contains the latest commit, but PR-1 does not (perhaps not yet cache-invalidated/re-indexed/refreshed) Immediately afterwards, when the second webhook-initiated check queries the GraphQL API, the missing commit may be backfilled into PR-1, but the timestamp check on L38 will exclude it....

...I'm really scratching my head as to what is going on. I'll continue investigation, with a forked version of this resource with some debugging to get more visibility into what is actually being retrieved from the API.

seanmailander-orange commented 6 years ago

With some hack-n-slash debugging, managed to capture the following debug statement

2018/06/07 21:19:29 check failed: skipping PR: 1488 953b8f2b9622a34d4bec2e75aab663f414ff30ab 2018-06-07 21:18:57 +0000 UTC < 2018-06-07 21:18:57 +0000 UTC

From the following code

        // Filter out commits that are too old.
        if !p.Tip.CommittedDate.Time.After(request.Version.CommittedDate) {
                // Look for pr-test-1 and pr-test-2, and throw an error if we ever skip them
            if strconv.Itoa(p.Number) == "1488" || strconv.Itoa(p.Number) == "1489" {
                return nil, fmt.Errorf("skipping PR: %v %v %v < %v", strconv.Itoa(p.Number), p.Tip.OID, p.Tip.CommittedDate.Time.String(), request.Version.CommittedDate.String())
            } 
            continue
        }
itsdalmo commented 6 years ago

Thanks for raising the issue and all the debugging you have done! This issue is slightly related to https://github.com/itsdalmo/github-pr-resource/issues/22, because I had to switch to using CommittedDate instead of PushedDate.

With that in mind, I think GraphQL and check are behaving as expected in this case:

As you say, the first push2 triggers and finds commit2 which was committed after commit1, so when push1 triggers another check commit1 will be filtered out. In this case of commit1 => commit2 => push it will discover both commits because the version passed to check will be older than both of your commits.

Obviously this is not an ideal solution, but I've struggled to find an alternative in the issue I linked above. Perhaps the solution is to just skip filtering on timestamps all together, and output versions that have been discovered before, and just let Concourse figure out which ones are new? (I believe this is how the old github-pullrequest-resource does it).

Thoughts?

seanmailander-orange commented 6 years ago

Agreed, the evidence I'm seeing all points to behaves-as-expected given the timestamps and order of operations

For our use case, we would be fine with latest-commit-for-each-PR as the array of versions, without any timestamp filtering...just giving that a test now.

Do you know of a good resource that would explain how Concourse would handle this? The idea of "multiple-independent-versions-in-one-resource" with the version: every qualifier is a bit of a hack, right? Does it just work regardless?

seanmailander-orange commented 6 years ago

Ah, I think I understand now how this could work with out-of-order commits

Add an extra field to each version object:

1.1 Add alreadySeen: a string-encoded array of key-value pairs PR#:committedDate

Change the sort behavior

2.1 For any PRs which have a newer commitDate than the one in alreadySeen, they are sorted to the top, ie "above-the-fold"

2.2 For any PRs which have the same (or older? someone doing destructive git?) committedDate than the one in alreadySeen, they are sorted to the bottom, ie "below-the-fold"

Change the version object

3.1 The above-the-fold versions are all assigned a new alreadySeen, generated from the current array of PR#:committedDate

3.2 The below-the-fold versions are all assigned the same alreadySeen as passed in on the version object

Thoughts

2.1 will ensure Concourse treats these out-of-order commits as 'newer' versions than the already-built versions. This gets around the issue of version: every never going backwards once a newer version has a scheduled build.

3.2 will ensure Concourse does not attempt to rebuild the same commit for a PR that already has scheduled builds

For the example: commit1 => commit2 => push2 => push1

Output at push2: [ commit2, commit1] Commit 2 would be a 'new' version and would be built

Output at push1: [ commit1, commit2] Commit 1 would be a 'new' version and would be built Commit 2 would be unchanged, and would not trigger any jobs

Thoughts?

seanmailander-orange commented 6 years ago

Testing with a release candidate of #28, seeing every commit trigger a build regardless of order of submission. Concourse looks happy, even though the alreadyseen field is pretty gnarly in the UI

image

itsdalmo commented 6 years ago

Thanks for filing the issue, finding a solution and submitting a PR! 👏

That being said, this all feels like a very hacky solution to me since we essentially end up in a situation where every version also includes a copy of all other versions (minus the commit SHA), which is something that is supposed to be handled by Concourse itself.

I'm leaning towards removing committed and always emitting the latest commit from all open PR's, which is what I believe the old resource did. Seems like this is the "correct" (still hacky, but only until resources v2 and spaces go live) solution according to the Concourse documentation:

If your resource is unable to determine which versions are newer then the given version (e.g. if it's a git commit that was push -fed over), then the current version of your resource should be returned (i.e. the new HEAD).

seanmailander-orange commented 6 years ago

No worries, happy to contribute!

If the resource always emits the latest commit for all PRs, what sort order is chosen?

This is not a quote, but my understanding of how resource versions are treated

When Concourse sees a "version" first in the list, for which it has already scheduled jobs, it wont schedule new jobs for any new/previously-unseen "version" lower in the sort order

Does that sound right?

If so, it implies to me that sorting needs to occur on something other than PR Number, commit hash, committed date, or any other metadata that is not semantically equivalent to "has this PR+commit already been seen by Concourse"

bhcleek commented 5 years ago

I'm really glad to see this is a known issue. I am contemplating migrating to this resource from jtarchie/pr-resource, but had this concern as we had this same problem with jtarchie/pr-resource. We ended up solving it by running a fork with a modified check implementation.

The modified check implementation simply puts all PRs except the one that was input to check after the the input version; there's no need to worry about the order, because concourse evaluates all versions that are provided after the input version. It seems to skip revisions that it's already built, too.

peikk0 commented 5 years ago

I'm currently evaluating this resource as well and I think I've experienced an issue related to this one. My test repository has a few old branches, I opened a PR for one of the most recent one, the webhook succeeded in triggering the resource, all was well. Then I opened a second PR for an older branch and then nothing happened, the resource would not see it. I had to rebase/amend and force-push that branch, updating the commit date, to finally make the resource pick it up.

arwineap commented 5 years ago

I'm leaning towards removing committed and always emitting the latest commit from all open PR's, which is what I believe the old resource did.

Just want to clarify that a similar issue existed on the old resource

leshik commented 5 years ago

I'm seeing the same issue but I'm not sure how to debug it, nor if it's related to out of order commits. Developers work in their forks, and PRs in our repo are almost always have just one commit. In case changes to a PR requested, the developer just force-pushes his branch. Maybe the issue arises somewhere between checks when a developer force-pushes.

insider89 commented 5 years ago

Have the same issue. Do you have any plans to fix this or maybe someone finds the workaround?

itsdalmo commented 5 years ago

@insider89 - I think the following is the only workaround:

I'm leaning towards removing committed and always emitting the latest commit from all open PR's, which is what I believe the old resource did.

However, I have not given it much thought since I made that comment; I've sort of been holding out to see how Resources v2 and spaces would affect this before making any changes.

mplzik commented 5 years ago

Hello; we use concourse to build a medium sized project and are seeing skipped PR checks. We did not manage to measure the impact of this, but it's definitely noticeable -- is there any way to workaround this? From what I've seen in #22, would be "using pushDate, if it exist; otherwise resorting to committedDate" an option?

eedwards-sk commented 5 years ago

I'm currently seeing skipped PRs as well, and suspect maybe it's related to this. It's a little unclear on the primary documentation whether or not all open PRs are supposed to be surfaced. That's not what we're seeing and it's very confusing to developers.

troykinsella commented 5 years ago

I'm seeing this issue, which seems related: Ref 'A' is pushed. The developer rebases, which puts ref 'B' above ref 'A' in the git log, and then force-pushes. This resource does not trigger a build of ref 'B'; ref 'A' remains at the head of the resource version list. The workaround I'm using, which is reliable but inconvenient, is to push an empty commit that generates a new ref that is newer than 'A' and 'B'.

The effect of this, though, is confusing to the developer. They can see that the pr job had built their PR # when looking at Concourse builds. In the GitHub PR view, the last ref prior to the force push shows a green check beside it, whereby Concourse successfully submitted check status for it. But down in the check summary at the bottom of the PR view, it appears that it is awaiting a Concourse status check, which never completes. The root of the problem is only noticed when examining commit refs more closely, which is a distraction to a normal workflow.

mplzik commented 5 years ago

@itsdalmo note that this issue manifests quite often at our team, basically forcing us to re-trigger PR checks quite often (on daily to weekly basis) by empty commits/force pushes. There's also some data that e.g. DigitalOcean noticed the issue.

The "Spaces" might not happen in the near future and although Resources v2 sound like a way to properly tackle this issue, it still might take some time to land in prod release.

Would there be any reasonable way to work around this for the time being by e.g. applying DO's fix?

rhoughton-pivot commented 4 years ago

@itsdalmo Is there any update on this issue? Concourse v5.8.0 has come out with the first precursor work on {{spaces}} but we're a long way from that being production ready.

jmccann commented 4 years ago

Would like thoughts on the idea of adding to the source a new optional status_context of a commit status to check on.

{
  "source": {
    "repostiroy": "git://some-uri",
    "owner": "develop",
    "status_context": "concourse-ci/build"
  },
  ...
}

What I'm thinking is when you run test/build on a PR/commit you usually push back a status on the commit to indicate its pending/success/failure. So instead of filtering out commits based on date we filter out commits based on them already having a status/context set on them.

I'm seeing that when a commit doesn't have a matching context name for a status on the commit it returns null in the query results:

                        "status": {
                          "context": null
                        }

Otherwise you get some values:

                        "status": {
                          "context": {
                            "context": "concourse-ci/build",
                            "state": "SUCCESS"
                          }
                        }

We would need to update the graphql query to include returning the context for a commit. Not sure how much more costly that makes the query.

I'll play with this idea ... but let me know any initial thoughts you may have on it.

jmccann commented 4 years ago

I've opened PR #189 with my stab at implementing my above comment. Let me know your thoughts!