kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.94k stars 903 forks source link

YAML-based configuration pipeline - Good or Bad #1963

Closed noklam closed 5 months ago

noklam commented 2 years ago

Warning This is a really long thread

This will be a step closer to answering the question, we probably don't have a clear answer yet, but let's try our best to address all these questions and document what is the current state of the issue.

Goal

To understand the "Why" and "What" users are doing with YAML base pipeline. Ultimately does Kedro team have a clear stand about this? What are the workarounds and suggestions?

Context

This is a long debated topic, but inspired by a recent internal discussion

It is also related to dynamic pipeline and more advance ConfigLoader. In native Kedro, we mainly have 2 YAML files parameters.yml and catalog.yml, some users create extra pipelines.yml to create pipeline at run-time. (Actually, we also have a config.yml if you want to override parameters at run time with a YAML config kedro run --config, but it's not very common)

It's also important to think about the issue in different dimensions, the conclusion and be different depending on these factors

Some quotes from the discussion (slightly modified)

Use Cases

Advance Config Loader - Class injection from YAML

Without Kedro's parameters, code may look like this.

def my_node(df):
    # do everything, even params is hardcoded inside
    # every node is basically a script

With parameters.yml, code is more readable and configurable. Parameters are now an argument of a node/function. However, it's not perfect, some boilerplate code is needed for dependency injection.

def my_node(df, params): # funcs that are kedro compatible can only accept simple types, or what my IO can return
    p1 = params.get(...)
    p2 = params.get(...)
    model = RandomForest(**p1) # no use of load_obj - need a new node if we want to change a new model

With libraries like hydra, some may define the class directly in YAML, load_obj is no longer needed. This is also a cleaner function, and easier to be re-used across different projects

def my_node(df, p1, p2, actual_object: BaseEstimator):
    ....

The YAML is now more than constants, it now contains actual Python logic. There are two sides of views.

Support:

Against:

Parameters are for things that need to be tweaked between one deployment environment and another. They are not meant to hold arbitrary Python objects or kwargs that can be used to generate arbitrary Python objects. load_obj should not be part of the public API and no one should use it.

you need to add documentation and validation for what is valid in the parameters file; node logic just becomes a thin wrapper rather than doing actual data processing; limited IDE support; business users just aren’t interested in changing some Python class from one thing to another, and if they do then chances are it will have lots of subtle side effects they don’t understand; anyone who might want to change the code logic has access to src

It seems to be very polar, some users really like to use YAML for more advance logic, but User research in about a year ago suggest somethings different

[Open Synthesis of user research when using configuration in Kedro

891](https://github.com/kedro-org/kedro/issues/891) Participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.

Dynamic pipeline - a for loop overriding subset of parameters

In the 0.16.x series, it's possible to read parameters to create the pipeline. They are essential "pipeline parameters", which are different from the parameters that get passed into a node.

This architecture diagrams clearly show that the pipeline is now created before the session gets created, which means reading config loader is not possible. image

Hey Team, I was wondering if there was an elegant solution to overwrite parameters dynamically? I am instantiating a pipeline 12 times, but they all need to run with a different parameter called date_max, e.g. “07/01/22” for the first one, and the other ones are decrementing one month, e.g. “06/01/22"... etc.. The pipelines are generated of a template dynamically and ideally I would just pass the adjusted parameter.

Alternatives could be:

Relevant issues:

t_-0:
  filters:
    date_max: 2022/07/01
t_-1:
  filters:
    date_max: 2022/06/01

Experimentation turning node on&off - dynamic numbers of features

Turn certain node on&off for feature experimentations

When I'm building a modelling pipeline, I tend to create a lot of features and feature groups, which I want to turn off and on depending on the experiment. Those features and usually independent from each other, so I can calculate them in parallel. For now, the best way I came up with to implement this is to create one node per each feature, and then a kwargs node that combines them together. However, I don't really want to spend resources for calculating features that I know I won't be using. What would be the kedro way to achieve this?

This is essentially a parallel pipeline, imagine a pipeline like this with 3 features generation nodes "A","B",C", and aggregation node "D". User may want to skip one of the node "A", but it's not possible to do it from parameters.yml, user will also have to change the definition of "D". As a workaround - users creating dynamic pipeline via YAML.

A--
    \
B-- D
    /
C--

Basically in your pipelines.py you need to make changes like this

pipeline([
  node(xx, "D1", "A"),~~
  node(xx", "D2"), node(xx, "D3"),
  node(aggregation, ["D1","D2","D3"], "D")
])

# It becomes something like this, **kwargs may helps a bit but it suffers from the same problem.
pipeline([
  node(xx", "D2"), node(xx, "D3"),
  node(aggregation, ["D2","D3"], "D")
])

Merging Dynamic number of datasets at run-time

def bucketize(
   base_df: Dataframe,
   buckets_config: Dict[str,Any],
   *apply_dfs: Optional[DataFrame],
) -> List[DataFrame]:
   ...

Note Congratulations, if you finished reading and understand most of it, you are probably a kedro expert already! 🎊

astrojuanlu commented 11 months ago

This debate comes up again from time to time. On our socials https://data-folks.masto.host/@1abidaliawan/111455801229101788 (cc @kingabzpro) (also cc @NeroOkwa because this came as a response to the "sharing Kedro Viz blog post").

It's important to realize that, the moment the pipeline definitions are fully declarative, any sort of reasoning around the pipelines becomes massively easier. Things like exporting the pipeline structure without having to execute the code (a major pain point in Kedro Viz), sharing the pipeline with others, authoring pipelines with a visual editor (like Orchest did), etc.

On the other hand, people tend to report lots of pain with YAML-based pipelines. The lack of flexibility means that these YAML files tend to be automatically generated... by a Python script, and debugging certain things becomes more difficult.

Not everybody might agree with this but I think it's useful to consider that there are different classes of YAML pipelines. One antipattern that makes things especially obnoxious in my opinion is including the code directly in the YAML definition (like this blog post illustrates) or falling back to some sort of Bash scripting (like DVC does).

So, to me the useful question is: is it possible to keep the good parts of YAML pipelines but try to avoid the bad ones?