Closed noklam closed 5 months ago
This debate comes up again from time to time. On our socials https://data-folks.masto.host/@1abidaliawan/111455801229101788 (cc @kingabzpro) (also cc @NeroOkwa because this came as a response to the "sharing Kedro Viz blog post").
It's important to realize that, the moment the pipeline definitions are fully declarative, any sort of reasoning around the pipelines becomes massively easier. Things like exporting the pipeline structure without having to execute the code (a major pain point in Kedro Viz), sharing the pipeline with others, authoring pipelines with a visual editor (like Orchest did), etc.
On the other hand, people tend to report lots of pain with YAML-based pipelines. The lack of flexibility means that these YAML files tend to be automatically generated... by a Python script, and debugging certain things becomes more difficult.
Not everybody might agree with this but I think it's useful to consider that there are different classes of YAML pipelines. One antipattern that makes things especially obnoxious in my opinion is including the code directly in the YAML definition (like this blog post illustrates) or falling back to some sort of Bash scripting (like DVC does).
So, to me the useful question is: is it possible to keep the good parts of YAML pipelines but try to avoid the bad ones?
This will be a step closer to answering the question, we probably don't have a clear answer yet, but let's try our best to address all these questions and document what is the current state of the issue.
Goal
To understand the "Why" and "What" users are doing with YAML base pipeline. Ultimately does Kedro team have a clear stand about this? What are the workarounds and suggestions?
Context
This is a long debated topic, but inspired by a recent internal discussion
It is also related to dynamic pipeline and more advance
ConfigLoader
. In native Kedro, we mainly have 2 YAML filesparameters.yml
andcatalog.yml
, some users create extrapipelines.yml
to create pipeline at run-time. (Actually, we also have aconfig.yml
if you want to override parameters at run time with a YAML configkedro run --config
, but it's not very common)It's also important to think about the issue in different dimensions, the conclusion and be different depending on these factors
Some quotes from the discussion (slightly modified)
Use Cases
Advance Config Loader - Class injection from YAML
Without Kedro's parameters, code may look like this.
With
parameters.yml
, code is more readable and configurable. Parameters are now an argument of a node/function. However, it's not perfect, some boilerplate code is needed for dependency injection.With libraries like
hydra
, some may define the class directly inYAML
,load_obj
is no longer needed. This is also a cleaner function, and easier to be re-used across different projectsThe YAML is now more than constants, it now contains actual Python logic. There are two sides of views.
Support:
parameters.yml
- more reusable code, simplified node (requires a specificConfigLoader
class and modification tohooks.py
atm)OmegaConf
change should make dependency injection easierAgainst:
parameters.yml
- over parameterization -> Trying to write Python code in YAML.It seems to be very polar, some users really like to use YAML for more advance logic, but User research in about a year ago suggest somethings different
Dynamic pipeline - a for loop overriding subset of parameters
In the
0.16.x
series, it's possible to read parameters to create the pipeline. They are essential "pipeline parameters", which are different from the parameters that get passed into anode
.This architecture diagrams clearly show that the pipeline is now created before the session gets created, which means reading config loader is not possible.
Alternatives could be:
kedro run --params=specific_date
and just make it a bash script or something to run it N timesRelevant issues:
Experimentation turning node on&off - dynamic numbers of features
Turn certain node on&off for feature experimentations
This is essentially a parallel pipeline, imagine a pipeline like this with 3 features generation nodes "A","B",C", and aggregation node "D". User may want to skip one of the node "A", but it's not possible to do it from
parameters.yml
, user will also have to change the definition of "D". As a workaround - users creating dynamic pipeline via YAML.Basically in your
pipelines.py
you need to make changes like thisMerging Dynamic number of datasets at run-time