Warning This is a really long thread

This will be a step closer to answering the question, we probably don't have a clear answer yet, but let's try our best to address all these questions and document what is the current state of the issue.

Goal

To understand the "Why" and "What" users are doing with YAML base pipeline. Ultimately does Kedro team have a clear stand about this? What are the workarounds and suggestions?

Context

This is a long debated topic, but inspired by a recent internal discussion

It is also related to dynamic pipeline and more advance ConfigLoader. In native Kedro, we mainly have 2 YAML files parameters.yml and catalog.yml, some users create extra pipelines.yml to create pipeline at run-time. (Actually, we also have a config.yml if you want to override parameters at run time with a YAML config kedro run --config, but it's not very common)

It's also important to think about the issue in different dimensions, the conclusion and be different depending on these factors

Users (Beginner/Expert)
Project Size (small/medium/large)
how much of the flow of the program is encoded in the pipeline DAG
how granular the nodes are
More...?

Some quotes from the discussion (slightly modified)

Use Cases

Advance Config Loader - Class injection from YAML

Without Kedro's parameters, code may look like this.

def my_node(df):
    # do everything, even params is hardcoded inside
    # every node is basically a script

With parameters.yml, code is more readable and configurable. Parameters are now an argument of a node/function. However, it's not perfect, some boilerplate code is needed for dependency injection.

def my_node(df, params): # funcs that are kedro compatible can only accept simple types, or what my IO can return
    p1 = params.get(...)
    p2 = params.get(...)
    model = RandomForest(**p1) # no use of load_obj - need a new node if we want to change a new model

With libraries like hydra, some may define the class directly in YAML, load_obj is no longer needed. This is also a cleaner function, and easier to be re-used across different projects

def my_node(df, p1, p2, actual_object: BaseEstimator):
    ....

The YAML is now more than constants, it now contains actual Python logic. There are two sides of views.

Support:

Dependency Injection from parameters.yml - more reusable code, simplified node (requires a specific ConfigLoader class and modification to hooks.py atm)
- The upcoming OmegaConf change should make dependency injection easier

Against:

Parameters are for things that need to be tweaked between one deployment environment and another. They are not meant to hold arbitrary Python objects or kwargs that can be used to generate arbitrary Python objects. load_obj should not be part of the public API and no one should use it.

Separation of Configuration != Put everything into parameters.yml - over parameterization -> Trying to write Python code in YAML.

you need to add documentation and validation for what is valid in the parameters file; node logic just becomes a thin wrapper rather than doing actual data processing; limited IDE support; business users just aren’t interested in changing some Python class from one thing to another, and if they do then chances are it will have lots of subtle side effects they don’t understand; anyone who might want to change the code logic has access to src

It seems to be very polar, some users really like to use YAML for more advance logic, but User research in about a year ago suggest somethings different

[Open Synthesis of user research when using configuration in Kedro

891](https://github.com/kedro-org/kedro/issues/891) Participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.

Dynamic pipeline - a for loop overriding subset of parameters

In the 0.16.x series, it's possible to read parameters to create the pipeline. They are essential "pipeline parameters", which are different from the parameters that get passed into a node.

This architecture diagrams clearly show that the pipeline is now created before the session gets created, which means reading config loader is not possible.

Hey Team, I was wondering if there was an elegant solution to overwrite parameters dynamically? I am instantiating a pipeline 12 times, but they all need to run with a different parameter called date_max, e.g. “07/01/22” for the first one, and the other ones are decrementing one month, e.g. “06/01/22"... etc.. The pipelines are generated of a template dynamically and ideally I would just pass the adjusted parameter.

Alternatives could be:

kedro run --params=specific_date and just make it a bash script or something to run it N times
Create a Meta-pipeline that contains many sub-pipelines with a different dates. The trickier part is that you cannot have identical names for input/output, so the best approach to this is probably creating a modular pipeline

Relevant issues:

t_-0:
  filters:
    date_max: 2022/07/01
t_-1:
  filters:
    date_max: 2022/06/01

Experimentation turning node on&off - dynamic numbers of features

Turn certain node on&off for feature experimentations

When I'm building a modelling pipeline, I tend to create a lot of features and feature groups, which I want to turn off and on depending on the experiment. Those features and usually independent from each other, so I can calculate them in parallel. For now, the best way I came up with to implement this is to create one node per each feature, and then a kwargs node that combines them together. However, I don't really want to spend resources for calculating features that I know I won't be using. What would be the kedro way to achieve this?

This is essentially a parallel pipeline, imagine a pipeline like this with 3 features generation nodes "A","B",C", and aggregation node "D". User may want to skip one of the node "A", but it's not possible to do it from parameters.yml, user will also have to change the definition of "D". As a workaround - users creating dynamic pipeline via YAML.

A--
    \
B-- D
    /
C--

Basically in your pipelines.py you need to make changes like this

pipeline([
  node(xx, "D1", "A"),~~
  node(xx", "D2"), node(xx, "D3"),
  node(aggregation, ["D1","D2","D3"], "D")
])

# It becomes something like this, **kwargs may helps a bit but it suffers from the same problem.
pipeline([
  node(xx", "D2"), node(xx, "D3"),
  node(aggregation, ["D2","D3"], "D")
])

Merging Dynamic number of datasets at run-time

def bucketize(
   base_df: Dataframe,
   buckets_config: Dict[str,Any],
   *apply_dfs: Optional[DataFrame],
) -> List[DataFrame]:
   ...

Note Congratulations, if you finished reading and understand most of it, you are probably a kedro expert already! 🎊

kedro-org / kedro

YAML-based configuration pipeline - Good or Bad #1963