treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.31k stars 222 forks source link

[Q] Depend on the past #615

Open kakoni opened 7 years ago

kakoni commented 7 years ago

Apache airflow has this feature called depends_on_past, where "task instances will depend on the success of the preceding task instance".

I find this extremely usable in my usecase where I've got daily recurring tasks, so task running on 20170806 depends on success of 20170805.

Not sure, can you do something similar with digdag?

hiroyuki-sato commented 7 years ago

@kakoni Have you ever tried require>? http://docs.digdag.io/operators/require.html#require-depends-on-another-workflow

I hope this is the operator you are looking.

kakoni commented 7 years ago

Aah right, I could do something like

+require:
  require>: ..SELF..
  session_time: ${last_session_time}

Yes that would work! One more question though, how do I get the initial state/run(=that can't be depended on last session ...)?

komamitsu commented 7 years ago

Hmm... last_session_time is calculated just based on current timestamp... https://github.com/treasure-data/digdag/blob/master/digdag-standards/src/main/java/io/digdag/standards/scheduler/SecondsIntervalSchedulerFactory.java#L73

Indeed, you can't depend on it.

komamitsu commented 7 years ago

Maybe you need to use external persistent data (e.g. local file) as a workaround like this.

+start:
  sh>: touch /tmp/${session_time}.lock

+check:
  sh>: if [ -f /tmp/${last_session_time}.lock ]; then exit 1; fi

+run:
  echo>: "Executing ${session_time}"

+end:
  sh>: rm /tmp/${session_time}.lock

It seems there is a room to improve the above workflow in terms of robustness, though.

kakoni commented 6 years ago

@hiroyuki-sato Does digdag have an interface to get previous instance for session?(=to get its status)

hiroyuki-sato commented 6 years ago

Hello, @kakoni

Could you tell me more detail about your question? Are you looking for CLI command like this? https://github.com/treasure-data/digdag/issues/603

Maybe there is no CLI interface yet.

kakoni commented 6 years ago

I was thinking about creating a new operator/extending require> with depends_on_past(=perhaps there is a better name, but using this for now) option.

In order to get that to work, I would need to access the previous instance for the current session. So in pseudo lang;

hiroyuki-sato commented 6 years ago

Hello, @kakoni

I have no idea yet. I'll let you know if I find a good solution. (Due to I'm not core developer, I have to read the source) Le'ts hacking digdag! :smile:

jaymed commented 5 years ago

@kakoni Did you ever find a solution to this problem? I'm dealing with the same thing. See #929.

kakoni commented 5 years ago

@jaymed Yes. I really wanted to use digdag for my usecases but as this depends on past is so essential for my workflows, I had to go with airflow..

jaymed commented 5 years ago

@kakoni OK makes sense. Thanks for getting back to me.

@hiroyuki-sato There's definitely a major need for this feature.

hiroyuki-sato commented 5 years ago

Hello, @kakoni and @jaymed

Thank you for commenting on a new feature.

Compare with AirFlow project(677 contributors), Digdag still develops with very a small team(58 contributors).

I will consider those requests.

By the way, I'm not familiar Apache AirFlow. Do you know how to write depends_on_tasks for an initial state in AirFlow? (It's mean that can't be depended on the last session ) https://github.com/treasure-data/digdag/issues/615#issuecomment-320591081

hiroyuki-sato commented 5 years ago

@muga Please take a look this Issue when you get a chance https://github.com/treasure-data/digdag/issues/929#issuecomment-454270266

yoyama commented 5 years ago

To solve #615 and #929, I would like to introduce new scheduler options. wait_until_last_schedule and wait_until_last_schedule_succeed as follows.

https://github.com/treasure-data/digdag/compare/master...yoyama:feature-wait_until_last_schedule?expand=1

How about these options?

kakoni commented 5 years ago

@hiroyuki-sato

Do you know how to write depends_on_tasks for an initial state in AirFlow? (It's mean that can't be depended on the last session )

Theres another configuration option called start_date. If your execution date is same as start_date then it doesn't depend on last session(As this is the initial/first state)

hiroyuki-sato commented 5 years ago

Hello, @kakoni

Thank you for your reply!

@yoyama Does wait_until_last_schedule and wait_until_last_schedule_succeed support start_date option in Airflow?

kakoni commented 5 years ago

Heres the logic in airflow if interested https://github.com/apache/airflow/blob/master/airflow/ti_deps/deps/prev_dagrun_dep.py#L47

y-ken commented 4 years ago

I am also wanted this feature. It is necessary for backfill multiple sessions but it need to proceed one-by-one. And also it need to run as single job such a memory consume workflow.