openSUSE / open-build-service

Build and distribute Linux packages from sources in an automatic, consistent and reproducible way #obs
https://openbuildservice.org
GNU General Public License v2.0
921 stars 437 forks source link

OBS "pull_request" workflow succeeds only in 3rd attempt #12661

Open mwilck opened 2 years ago

mwilck commented 2 years ago

Issue Description

I use an OBS pull_request workflow triggered by an Github webhook for CI. This worked only after re-triggering the webhook 2x from the GH web UI (i.e. the 3rd invocation of the workflow worked). The original delivery (created automatically by GH when the PR was updated) failed with error 400 (service in progress). The 2nd one (manual redelivery) succeeded, but created a bogus branch package. The 3rd one (another redelivery) went as expected.

Expected Result

Workflow succeeds the first time.

How to Reproduce

This happened to me 2 times. I am not sure if it's always reproducible.

  1. I use this worlkflows.yml on my master branch

  2. I deleted the target project. In my experience this is necessary if anything went wrong or must be fixed wrt the respository setup of the target project

  3. I pushed to https://github.com/openSUSE/suse-module-tools/pull/60, triggering the webhook ce58aa50-e249-11ec-9a5b-9335ee20e1aa at 2022-06-02 09:58:38 (workflow 43277), which failed with error code 400

        <status code="400" origin="backend">
          <summary>service in progress</summary>
        </status>
  4. I deleted the target project again, waited a while, and redelivered the same webhook from the GH UI. This time workflow 43280 succeeded, but it created a bogus branch package containing just a single file _branch_request:

    osc -A obs ls  home:mwilck:openSUSE:suse-module-tools:PR-60 suse-module-tools
    _branch_request
    osc -A obs cat  home:mwilck:openSUSE:suse-module-tools:PR-60 suse-module-tools _branch_request
    {"action":"opened","pull_request":{"head":{"repo":{"full_name":"openSUSE/suse-module-tools"},"sha":"c8a42cd2e45c68c68632f0ffe1f4175f7e65ec51"}}}
  5. I deleted the target project once more, waited, and redelivered the webhook again. workflow 43283 suceeded and successfully branched and compiled my package

mwilck commented 2 years ago

I've merged my PR, which unfortunately deleted the OBS project. But the status was good there, anyway.

dmarcoux commented 2 years ago

Are workflows sometimes flickering?

perlpunk commented 4 weeks ago

I just wanted to mention that we regularly see the same problem (just that the http status usually is 504 when it fails). We also then delete the project and retrigger the webhook. That's a bit cumbersome, especially since we don't yet have the actual URL to the OBS project to click on.

hennevogel commented 4 weeks ago

@perlpunk what exactly is happening? There is no guaranteed delivery of the webhooks send to the OBS or or notifications OBS sends to github. Things can be down, broken or otherwise in bad shape. In that case the solution is: manual retrigger.

So what exactly happens? And where, when etc.? Please provide some details. TIA

perlpunk commented 3 weeks ago

We regularly (on average maybe once a week) have the problem that for a pull request like this https://github.com/os-autoinst/openQA/pull/5878 we don't see the OBS SCM integration finishing. Usually the OBS branched project is created, but almost empty except for a _branch_request file or so. The problem is, we don't see it immediately. A pull request can hang around for a while, and after 2 days, the webhook request is expired from the log. We just see that not all required checks are there, and then our manual work starts.

In the webhook delivery log we usually see a 504 response from OBS, with an empty header and body.

Now, if the delivery is still in the webhook logs, we can go there and retrigger it, but already that is annoying, and apparently there can be also 504 failures that still result in a successful OBS build, so we need to find out which hook delivery we want to retrigger. If it's not there, apparently a force push like after a rebase will work. But first we actually need to delete the broken project from OBS, and for that we need to find out the URL for it, because we don't have the URL, because the OBS status check isn't reported. (But maybe I'm wrong here and we don't have to delete that project first?) That's all a lot of work.

I can understand that things can fail, but I don't see such things happen for other services, and I'm just wondering if there is something to make it easier to retrigger something without having to remember a lot of manual steps. And experience shows that people don't just do it, it's usually only one or two persons who realize there's something missing and then do something about it.

We are currently trying several things. One is a simple github action that gives us the url to the OBS project, so we can just click it and delete it, and then retrigger. The other is adding a workflow that regularly resends failed webhook deliveries.

We are tracking this here: https://progress.opensuse.org/issues/165144

If there is no easy solution on the OBS side, ok. I just wanted to note it here and ask if there are hints what to do about it.

perlpunk commented 3 weeks ago

Also we have several OBS webhooks configured, and everyone who isn't yet familiar with this has no idea which of the webhooks they need to select (because github has no way of documenting them like giving them a label).