treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

Backfill sessions are not sequential after unpause #929

Open jaymed opened 5 years ago

jaymed commented 5 years ago

When you pause/un-pause a workflow the backfill behavior switches to parallel backfill rather than the expected sequential backfill. For example, a workflow with a one minute interval schedule that is paused for 10 minutes will run 10 parallel sessions as soon as it is un-paused. In contrast, a backfill command with the count set to 10 will run those 10 backfill sessions sequentially. I believe the expected behavior is to run things sequentially by giving the un-pause/backfill operation a chance to properly cycle through all the session times that were missed. This seems to be a bug. The impact is that multiple jobs running in parallel fail due to resource dependencies.

Digdag server version: 0.9.31 Database: PostgreSQL Log/archive storage: S3

hiroyuki-sato commented 5 years ago

Hello, @jaymed

IIUC, Digdag execute backfilled tasks in parallel when digdag server un-pause. It's not a bug. And also Digdag backfill command does not guarantee sequential. The current implementation just backfills it one by one. If your task depends on another task. you may use require> operator like the below.

For example

timezone: UTC

schedule:
  monthly>: 1,09:00:00

+depend_on_all_daily_workflow_in_month:
  loop>: ${moment(last_session_time).daysInMonth()}
  _do:
    require>: daily_workflow
    session_time: ${moment(last_session_time).add(i, 'day')}
jaymed commented 5 years ago

Thank you for your response. Unfortunately it does not address my issue. I am really just looking for a simple way to avoid two sessions (same workflow) from colliding. There are options to skip a session if it runs over time and collides with the following session but there are no options to wait. This would be great to have since it would allow workflows to catch up after some time.

hiroyuki-sato commented 5 years ago

Hello, @jaymed

Could you tell us an example workflow? I can't imagine your problem. If your task depends on another task it may useful require> operator. If you skip backfill, skip_delayed_by may help.

jaymed commented 5 years ago

Please consider the following timeline diagram of a sample workflow. This sample workflow does not depend on any other workflow. Each workflow session is required to run to completion before the next session starts. In this diagram the workflow is paused after session 2 and unpaused sometime after session 7 would have been scheduled. What I would like to see is for session 3 - 7 to be backfilled sequentially. The problem I see now is that these session (3 - 7) are being backfilled in parallel when I unpause the workflow. We cannot have multiple sessions belonging to the same workflow running in parallel and we want to backfill all missed sessions (3 - 7) one by one without skipping any. Session 8 would also need to wait until 3 -7 have finished. How can this be accomplished?

digdag pause timeline 1

hiroyuki-sato commented 5 years ago

Hello, @jaymed

Maybe you need depends_on_past which implemented in Apache Airflow, don't you? If session 2 run until session 3 start time, you need to wait for start session 3 until session 2 complete, don't you?

It's similar to #615 issue.

jaymed commented 5 years ago

@hiroyuki-sato Yes, it is exactly like depends_on_past. Airflow is great and it has many features not yet available in digdag. One reason why we are not using airflow is because it does not offer the simplicity of digdag. The streamlined approach offered by digdag is why we chose it over Airflow. With that said, would it be possible to implement such a feature in digdag? The similar issue #615 you referenced is still Open.

yoyama commented 5 years ago

I would like to propose introducing wait_until_last_schedule option in schedule as follows.

https://github.com/treasure-data/digdag/compare/master...yoyama:feature-wait_until_last_schedule

If this option is true and there is active attempt, schedule will be delayed. As result, only one session will run. I am still testing it, it looks like this patch works well.

Is it available to resolve this issue?

jaymed commented 5 years ago

I would like to propose introducing wait_until_last_schedule option in schedule as follows.

This is great! Could the new option be named wait_on_overtime? It would match up with the existing skip_on_overtime schedule option.

yoyama commented 5 years ago

Thank you for your proposal on the name of option. I don't know which is better. As you mention, wait_on_overtime match with skip_on_overtime. But wait_until_last_schedule may be easy to understand. I hope another persons comments.