Open DimedS opened 2 weeks ago
Rendered docs for anyone else reviewing: https://kedro--3948.org.readthedocs.build/en/3948/
Thank you very much for the comprehensive review, @merelcht! I agree that the current flow is not ideal. Let's brainstorm how we can restructure it. My main idea was to split the content into two docs:
I also agree that the cooking example isn't great. I'd be happy to replace it with a more realistic scenario, like a spaceflights extension. Let's wait for some more feedback and then think how to refine the structure!
@merelcht, @ankatiyar, thank you for your reviews! I have addressed most of your comments. I would appreciate any advice on micropackaging and providing a clearer explanation of namespaces. Additionally, I did not modify the cookie examples for modular pipelines; let's confirm any changes regarding that with @astrojuanlu and Jo.
With micro-packging deprecating, do we fade out the "modular pipeline" concept too? My answer is no since kedro pipeline create will still use it
Agreed, we should keep the term
@astrojuanlu, @noklam, thank you for your comments. I have merged modular pipelines with pipeline objects on one page to simplify things for beginner users. I felt that it wasn't essential for them to understand these as two conceptual objects. Instead, they can simply use "kedro pipeline create," understand how to fill the template, and get started. However, I am not happy with the current structure because it is too long, and the "kedro pipeline create" section is at the end. Therefore, @astrojuanlu, I fully support your proposed structure and will create a separate page for modular pipelines with a clear explanation of the folder structure.
I have changed the structure according to @astrojuanlu's proposal and addressed other minor comments. I am now re-requesting reviews.
Also wondering why the images are smaller now? π
I resized them to smaller ones because previously I saw them (on the screen below) in a very weird way:
Gave another pass, left some comments. We're on the right path!
Thank you for your additional comments, @astrojuanlu. I agree with them and have modified the docs accordingly. It would be great if you could take another look at them.
I resized them to smaller ones because previously I saw them (on the screen below) in a very weird way:
Oh I see. Can we control the size on the docs instead? Something like
```{image} ../meta/images/complete-mlflow-tracking-kedro-mlflow.png
:alt: Complete MLflow tracking with kedro-mlflow
:width: 80%
:align: center
(and we revert the size changes of the images)
I resized them to smaller ones because previously I saw them (on the screen below) in a very weird way:
Oh I see. Can we control the size on the docs instead? Something like
```{image} ../meta/images/complete-mlflow-tracking-kedro-mlflow.png :alt: Complete MLflow tracking with kedro-mlflow :width: 80% :align: center
(and we revert the size changes of the images)
Why do you think it's better to download a larger image and resize it every time a user opens a page, instead of resizing it once to produce a smaller image?
My thinking was that it would be good to retain the original high-resolution images in case somebody has a high density screen, or wants to zoom-in for whatever reason, or we can fully use the high-res version if/when we redesign how images are placed (#2643) or the whole docs (https://github.com/kedro-org/kedro-sphinx-theme/issues/5)
No strong objection though
Thank you for your comments @noklam, @stichbury, @merelcht. I have revised the namespaces section by adding examples based on the spaceflights-pandas starter. I hope this provides clarity on how namespaces work and how to use them effectively. Could you please re-review that section?
Hi @yury-fedotov, thanks a lot for your review! ππΌ
I'd like to add clarification on two bits though:
kedro.pipeline.modular_pipeline.pipeline
and kedro.pipeline.pipeline
are exactly the same, as I painfully discovered in #2723 one year ago when I taught namespaces for the first time. I'd even advocate to make it private API, but that's for another day. Regardless, it's already time to remove it from our docs.kedro pipeline create
) but not all of them use namespaces, and it's fine. What's more: somebody could have non-modular pipelines (hence by not following the directory structure recommended by Kedro) and still use namespaces. The messy terminology made these concepts difficult to understand even for Kedro power users and the Kedro team itself (in fact, this PR was prompted by a discussion about kedro-airflow
), so we considered it was urgent to clarify all these topics.I'll let @DimedS act on the rest of your review comments as he sees fit.
@astrojuanlu
To be on the same page in terminology, by modular pipeline I mean a pipeline that is:
pipeline()
wrapper, which overrides abstract inputs and outputs by specific catalog datasets, and this let's say "instance" is what's being registred.most Kedro projects make use of modular pipelines
Yeah but I'd assume most use them in a non-modular fashion, i.e. not leveraging inputs
, outputs
, parameters
and namespace
arguments of pipeline
. Actually because of this what I do is after kedro pipeline create
I go to pipeline.py
and change the pipeline constructor from pipeline
to Pipeline
if I do not use namespaces.
I don't think modular pipelines and namespaces are connected at all
I'm curious to understand why. IMO there are 2 reasons in Kedro to use modular pipelines:
viz
output and overall mental model by namespacing all internals so that other pipelines care only about what's on the edges.In either case, namespace
is what enables you do that. I don't understand why are they not connected.
Can you give an example of when you'd leverage a modular pipeline without a namespace
?
by modular pipeline I mean a pipeline that is:
I don't claim to be in possession of the right meaning for "modular pipeline" but we are talking about 2 different things. Maybe we need some arbitration :) @idanov @merelcht @noklam
by modular pipeline I mean a pipeline that is:
I don't claim to be in possession of the right meaning for "modular pipeline" but we are talking about 2 different things. Maybe we need some arbitration :) @idanov @merelcht @noklam
Sure, I didn't mean my definition is correct, that's just what I have in mind currently. Would be happy to change my mind if it's inaccurate.
But overall, when you say that modular pipelines and namespaces aren't connected, is there an example of a pipeline that's modular but not using namespaces to act as modular?
is there an example of a pipeline that's modular but not using namespaces to act as modular?
According to my current understanding of what modular pipelines are,
That one is easy:
$ kedro pipeline create
...
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
)
In pipeline_registry.py
:
def register_pipelines() -> Dict[str, Pipeline]:
base_pipeline = pipeline(
[
node(
func=split_data,
inputs=["model_input_table", "params:model_options"],
outputs=["X_train", "X_test", "y_train", "y_test"],
name="split_data_node",
),
node(
func=train_model,
inputs=["X_train", "y_train"],
outputs="regressor",
name="train_model_node",
),
node(
func=evaluate_model,
inputs=["regressor", "X_test", "y_test"],
outputs=None,
name="evaluate_model_node",
),
]
)
ds_pipeline_1 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="active_modelling_pipeline",
)
ds_pipeline_2 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
return {"__default__": ds_pipeline_1 + ds_pipeline_2}
So the pipeline is defined directly in the pipeline_registry.py
, hence not "modular" (unusual but nonetheless allowed), and it makes use of namespaces.
Also note that our docs already say that the way to create & delete modular pipelines is with kedro pipeline create/delete
https://docs.kedro.org/en/0.19.6/development/commands_reference.html#modular-pipelines
@astrojuanlu thanks for providing those examples!
Yeah I think we just define modular pipelines differently. To simplify context a bit, let's assume that kedro pipeline create/delete
do not exist, and we create pipeline folder structure ourselves. It actually takes a few seconds to do, so I think this example is quite realistic.
Pipeline()
constructor as give it a list of nodes?base_pipeline
is what I would call modular. It is reusable and not registred itself, but rather 2 instances of that are created. They are created using namespaces, which allow reusability. I get why you call it not modular, but this is just because it's defined directly in the pipeline_registry.py
, which to me is a weird pattern. If we take all this:base_pipeline = pipeline(
[
node(
func=split_data,
inputs=["model_input_table", "params:model_options"],
outputs=["X_train", "X_test", "y_train", "y_test"],
name="split_data_node",
),
node(
func=train_model,
inputs=["X_train", "y_train"],
outputs="regressor",
name="train_model_node",
),
node(
func=evaluate_model,
inputs=["regressor", "X_test", "y_test"],
outputs=None,
name="evaluate_model_node",
),
]
)
ds_pipeline_1 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="active_modelling_pipeline",
)
ds_pipeline_2 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
Move it to manually created pipelines/data_science/pipeline.py
, and then import ds_pipeline_1
and ds_pipeline_2
in registry, and registred. Would this make this pipeline modular?
I'm happy to do a final read through when further requested changes are complete. Please just ping me @DimedS when you're ready!
Many thanks for your review, @yury-fedotov, and your comments, @astrojuanlu. I mostly agree with @astrojuanlu's understanding that modular pipelines are currently a more virtual concept - just a folder structure. For me, clarifying exactly what that means is not as important. I actually wanted to remove the term "modular" entirely to not to confuse users. What's more important to me is to clarify for users how to effectively use Kedro to solve their different tasks.
I like the current docs structure from that perspective because it's more task-oriented now:
pipeline create
command, and describe the general pipeline structure.
Description
Revised the pipeline and modular-pipelines documentation sections, introduced namespaces docs section.
Main Changes:
Pipeline creation structure
section, which describes how to create a pipeline after using thekedro pipeline create
command, what to include innodes.py
andpipeline.py
, and why, as well as other available options.Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file