Remote-join job doesn't wait on downstream jobs during restart

jithine commented 6 months ago

What happened:

When an upstream job of the remote-join job is restarted, remote-join job immediately runs reusing build status from previous event.

https://cd.screwdriver.cd/pipelines/12907/events/784132

Job first was restarted from previous event, where remote-join in job second worked as expected, however in the new event, job second ran immediately without waiting for any of the downstream to be triggered or executed.

What you expected to happen:

Job second in the restarted event should depend for status from newly triggered remote downstream jobs.

How to reproduce it:

jithine commented 6 months ago

FYI @kumada626 @y-oksaku

y-oksaku commented 5 months ago

This behavior occurs irrespective of the remote trigger. When restart the first build, the z build does not wait for the y-third build. The latest y-third build at the moment x-second triggers z is that of the parent event.

jobs:
  first:
    requires: [ ~commit ]
    steps:
      - echo: echo 'no wait'
  x-second:
    requires: [ first ]
    steps:
      - echo: echo 'no wait'
  y-second:
    requires: [ first ]
    steps:
      - wait: sleep 30
  y-third:
    requires: [ y-second ]
    steps:
      - echo: echo 'no wait'
  z:
    requires: [ x-second, y-third ]
    steps:
      - echo: echo 'no wait'

In your case, at the moment the second build was triggered, the downstream builds did not yet exist.

Upstream builds:

first completed at 23:31:36
second created at 23:31:37

Downstream builds:

l2-1 created at 23:31:42
l2-2 created at 23:31:46
l2-3 created at 23:31:50

Therefore, when evaluating whether to trigger the second build, the downstream build statuses from the parent event were used. Based on this, it was determined that the conditions were satisfied.

y-oksaku commented 5 months ago

Although, this behavior seems logical.

Triggering a restart only after all 'join' builds have completed would require calculating which builds are scheduled to execute in the future. This logic could become excessively complex, particularly in situations where there are multiple restart events or when users stop or disable some builds during events are running.

In my opinion, the logic that employs the latest build statuses at a particular moment, much like a snapshot, is simple and elegant.

y-oksaku commented 5 months ago

jithine commented 5 months ago

@y-oksaku I believe current logic is buggy.

In most cases we've seen is that if user is restarting an upstream job, their intention is all downstream jobs to run new. But what's happening here is that among the downstream jobs a join job is behaving differently, it's running while taking status from previous events and not waiting for upstream builds in current event

The whole aspect of a restarted build looking back at the status of builds of it's upstream jobs, from previous events, should be applicable only if that job doesn't get executed in the current event

I agree with your analysis that this logic would be a bit complex, however without it we are breaking user expectation.

This behavior occurs irrespective of the remote trigger. When restart the first build, the z build does not wait for the y-third build. The latest y-third build at the moment x-second triggers z is that of the parent event.
jobs:
  first:
    requires: [ ~commit ]
    steps:
      - echo: echo 'no wait'
  x-second:
    requires: [ first ]
    steps:
      - echo: echo 'no wait'
  y-second:
    requires: [ first ]
    steps:
      - wait: sleep 30
  y-third:
    requires: [ y-second ]
    steps:
      - echo: echo 'no wait'
  z:
    requires: [ x-second, y-third ]
    steps:
      - echo: echo 'no wait'
In your case, at the moment the second build was triggered, the downstream builds did not yet exist.

Upstream builds:
* [first](https://cd.screwdriver.cd/pipelines/12907/builds/917364/steps/sd-teardown-screwdriver-cache-bookend) completed at 23:31:36

* [second](https://cd.screwdriver.cd/pipelines/12907/builds/917365/steps/sd-setup-init) created at 23:31:37
Downstream builds:
* [l2-1](https://cd.screwdriver.cd/pipelines/12908/builds/917366/steps/sd-setup-init) created at 23:31:42

* [l2-2](https://cd.screwdriver.cd/pipelines/12909/builds/917367/steps/sd-setup-init) created at 23:31:46

* [l2-3](https://cd.screwdriver.cd/pipelines/12910/builds/917368/steps/sd-setup-init) created at 23:31:50
Therefore, when evaluating whether to trigger the second build, the downstream build statuses from the parent event were used. Based on this, it was determined that the conditions were satisfied.

y-oksaku commented 5 months ago

@jithine Thanks for your opinion.

Indeed, as you've noted, I believe that the behavior you've described would align with user expectations.

But given that the logic is expected to become complex, I think it would be good to make corrections after conducting a detailed examination of the behavior and design, along with other trigger issues.

jithine commented 5 months ago

@y-oksaku Here are our thoughts on restart scenarios

When a build has been restarted from workflowgraph, it creates a new event with same groupEventId as it's parent. Now if there is a join build in the sub-workflowgraph for this execution, then computing it's execution readiness should depende on the following conditions.

If there is a path towards any of it's upstream job (described in the requires) in the current event's workflowgraph, then this join build shouldn't be started until that path completes and produces a successful build for the upstream job.
If there is no path towards any of it's upstream job in the current event's workflowgraph, then this join build should retrieve the status of the upstream job's build by looking at builds from the grouped events.
A join build should be started in same event which lead to it's execution. This should address https://github.com/screwdriver-cd/screwdriver/issues/2957
For a remote join job where a remote upstream job was restarted (due to failure), the join job may continue in the previous event (upstream event which started the remote job) which stopped execution. We need to check if this is the current behavior, if we are already creating a new event linked via groupEventId then let's continue that behavior.
For a remote join job where a remote upstream job was restarted (even if it was successful), the join job should continue in the a new event which should be linked via groupEventId

y-oksaku commented 5 months ago

@jithine Thank you for your detailed examination. I also think this approach seems good. However, there might be some corner cases, so we will proceed with corrections cautiously.

screwdriver-cd / screwdriver

Remote-join job doesn't wait on downstream jobs during restart #2981