kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Incremental runs/"Run only missing" #221

Closed yetudada closed 8 months ago

yetudada commented 4 years ago

Description

We're taking the principle of Change Data Capture a step further and looking at a way for Kedro to recognise code, parameter and data changes and only re-run the sections that need to be rebuilt to affect the downstream pipelines.

You have called this run-only missing in #82 and #30, and we're finally getting smart about it.

Context

We're going to help shorten your development time when running your pipeline because you don't have to worry about re-running the entire pipeline anymore.

yetudada commented 4 years ago

I'm going to link some of #225 to this. It had some great ideas that we could use here.

astrojuanlu commented 1 year ago

This was brought up by a user recently (cc @pascalwhoop), but the title of the issue might make it difficult to locate. "Run only missing", "incremental runs", "change detection" could be some possible themes.

It is worth noting that to make this feasible, kedro run would need to be stateful rather than stateless. A plugin could potentially take care of that, basically what Kedro Viz does through the session store https://docs.kedro.org/en/stable/experiment_tracking/#set-up-the-session-store and using some smart hashing and/or comparing the "last modified" date with the session run date.

However, this would also move us closer to the "actually-an-orchestrator" territory, which we've been trying to avoid.

I think making kedro run smarter would be a big improvement for lots of users, but ahead of attempting this we should better understand what are the alternatives.

noklam commented 1 year ago

This would be useful for interactive run too. Stateful runs will also open up to a "lineage" problem. i.e. pipeline_1 create dataset_1 and pipeline_2 depend on dataset_1, is it possible to re-create the whole run history.

These are all interesting and useful features, but they are also very challenging.

astrojuanlu commented 1 year ago

Related: https://openlineage.io/

astrojuanlu commented 10 months ago

After reading more on data pipelines and Change Data Capture (this is the blog post that prompted me to come here https://debezium.io/blog/2018/07/19/advantages-of-log-based-change-data-capture/) I think calling this "Change Capture" is quite confusing. I will rename the issue for clarity.

astrojuanlu commented 8 months ago

This is essentially a duplicate of #2307.