kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Revise modular pipelines docs #3948

Open DimedS opened 2 weeks ago

DimedS commented 2 weeks ago

Description

Revised the pipeline and modular-pipelines documentation sections, introduced namespaces docs section.

Main Changes:

  1. Sorted topics within these sections: simple usage is now in the pipeline introduction, while pipeline reusage is covered in the advanced-level second document.
  2. Added a table of contents with links at the top of the pipeline introduction.
  3. Also added a Pipeline creation structure section, which describes how to create a pipeline after using the kedro pipeline create command, what to include innodes.py and pipeline.py, and why, as well as other available options.
  4. Grouped key pipeline methods and operations.
  5. Simplified the pipeline reuse documentation to focus more on practical implementation.
  6. Reduced the size of pictures by 60% as they were too large and not useful.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

merelcht commented 2 weeks ago

Rendered docs for anyone else reviewing: https://kedro--3948.org.readthedocs.build/en/3948/

DimedS commented 2 weeks ago

Thank you very much for the comprehensive review, @merelcht! I agree that the current flow is not ideal. Let's brainstorm how we can restructure it. My main idea was to split the content into two docs:

  1. Basic topics that everyone should know (including how to use the Kedro pipeline create command and our recommended practices for writing pipeline code).
  2. Advanced topics on pipeline reuse, including namespacing and overwriting.

I also agree that the cooking example isn't great. I'd be happy to replace it with a more realistic scenario, like a spaceflights extension. Let's wait for some more feedback and then think how to refine the structure!

DimedS commented 1 week ago

@merelcht, @ankatiyar, thank you for your reviews! I have addressed most of your comments. I would appreciate any advice on micropackaging and providing a clearer explanation of namespaces. Additionally, I did not modify the cookie examples for modular pipelines; let's confirm any changes regarding that with @astrojuanlu and Jo.

astrojuanlu commented 1 week ago

With micro-packging deprecating, do we fade out the "modular pipeline" concept too? My answer is no since kedro pipeline create will still use it

Agreed, we should keep the term

DimedS commented 1 week ago

@astrojuanlu, @noklam, thank you for your comments. I have merged modular pipelines with pipeline objects on one page to simplify things for beginner users. I felt that it wasn't essential for them to understand these as two conceptual objects. Instead, they can simply use "kedro pipeline create," understand how to fill the template, and get started. However, I am not happy with the current structure because it is too long, and the "kedro pipeline create" section is at the end. Therefore, @astrojuanlu, I fully support your proposed structure and will create a separate page for modular pipelines with a clear explanation of the folder structure.

DimedS commented 1 week ago

I have changed the structure according to @astrojuanlu's proposal and addressed other minor comments. I am now re-requesting reviews.

DimedS commented 1 week ago

Also wondering why the images are smaller now? πŸ˜…

I resized them to smaller ones because previously I saw them (on the screen below) in a very weird way:

Screenshot 2024-06-24 at 11 33 13
DimedS commented 1 week ago

Gave another pass, left some comments. We're on the right path!

Thank you for your additional comments, @astrojuanlu. I agree with them and have modified the docs accordingly. It would be great if you could take another look at them.

astrojuanlu commented 1 week ago

I resized them to smaller ones because previously I saw them (on the screen below) in a very weird way:

Oh I see. Can we control the size on the docs instead? Something like

```{image} ../meta/images/complete-mlflow-tracking-kedro-mlflow.png
:alt: Complete MLflow tracking with kedro-mlflow
:width: 80%
:align: center


(and we revert the size changes of the images)
DimedS commented 1 week ago

I resized them to smaller ones because previously I saw them (on the screen below) in a very weird way:

Oh I see. Can we control the size on the docs instead? Something like

```{image} ../meta/images/complete-mlflow-tracking-kedro-mlflow.png
:alt: Complete MLflow tracking with kedro-mlflow
:width: 80%
:align: center


(and we revert the size changes of the images)

Why do you think it's better to download a larger image and resize it every time a user opens a page, instead of resizing it once to produce a smaller image?

astrojuanlu commented 1 week ago

My thinking was that it would be good to retain the original high-resolution images in case somebody has a high density screen, or wants to zoom-in for whatever reason, or we can fully use the high-res version if/when we redesign how images are placed (#2643) or the whole docs (https://github.com/kedro-org/kedro-sphinx-theme/issues/5)

No strong objection though

DimedS commented 3 days ago

Thank you for your comments @noklam, @stichbury, @merelcht. I have revised the namespaces section by adding examples based on the spaceflights-pandas starter. I hope this provides clarity on how namespaces work and how to use them effectively. Could you please re-review that section?

astrojuanlu commented 1 day ago

Hi @yury-fedotov, thanks a lot for your review! πŸ™πŸΌ

I'd like to add clarification on two bits though:

I'll let @DimedS act on the rest of your review comments as he sees fit.

yury-fedotov commented 1 day ago

@astrojuanlu

To be on the same page in terminology, by modular pipeline I mean a pipeline that is:

  1. Defined but not registered itself
  2. Leveraged in pipeline() wrapper, which overrides abstract inputs and outputs by specific catalog datasets, and this let's say "instance" is what's being registred.
  3. Now for simplicity let's assume we only do 1 level of namespacing.

most Kedro projects make use of modular pipelines

Yeah but I'd assume most use them in a non-modular fashion, i.e. not leveraging inputs, outputs, parameters and namespace arguments of pipeline. Actually because of this what I do is after kedro pipeline create I go to pipeline.py and change the pipeline constructor from pipeline to Pipeline if I do not use namespaces.

I don't think modular pipelines and namespaces are connected at all

I'm curious to understand why. IMO there are 2 reasons in Kedro to use modular pipelines:

  1. Simplify viz output and overall mental model by namespacing all internals so that other pipelines care only about what's on the edges.
  2. Reuse the same pipeline structure more than once.

In either case, namespace is what enables you do that. I don't understand why are they not connected.

Can you give an example of when you'd leverage a modular pipeline without a namespace?

astrojuanlu commented 1 day ago

by modular pipeline I mean a pipeline that is:

I don't claim to be in possession of the right meaning for "modular pipeline" but we are talking about 2 different things. Maybe we need some arbitration :) @idanov @merelcht @noklam

yury-fedotov commented 23 hours ago

by modular pipeline I mean a pipeline that is:

I don't claim to be in possession of the right meaning for "modular pipeline" but we are talking about 2 different things. Maybe we need some arbitration :) @idanov @merelcht @noklam

Sure, I didn't mean my definition is correct, that's just what I have in mind currently. Would be happy to change my mind if it's inaccurate.

But overall, when you say that modular pipelines and namespaces aren't connected, is there an example of a pipeline that's modular but not using namespaces to act as modular?

astrojuanlu commented 23 hours ago

is there an example of a pipeline that's modular but not using namespaces to act as modular?

According to my current understanding of what modular pipelines are,

Modular with no namespaces

That one is easy:

$ kedro pipeline create

...

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ]
    )

Not modular with namespaces

In pipeline_registry.py:

def register_pipelines() -> Dict[str, Pipeline]:
    base_pipeline = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )
    ds_pipeline_1 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )

    return {"__default__": ds_pipeline_1 + ds_pipeline_2}

So the pipeline is defined directly in the pipeline_registry.py, hence not "modular" (unusual but nonetheless allowed), and it makes use of namespaces.

astrojuanlu commented 22 hours ago

Also note that our docs already say that the way to create & delete modular pipelines is with kedro pipeline create/delete https://docs.kedro.org/en/0.19.6/development/commands_reference.html#modular-pipelines

yury-fedotov commented 21 hours ago

@astrojuanlu thanks for providing those examples!

Yeah I think we just define modular pipelines differently. To simplify context a bit, let's assume that kedro pipeline create/delete do not exist, and we create pipeline folder structure ourselves. It actually takes a few seconds to do, so I think this example is quite realistic.

base_pipeline = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )
    ds_pipeline_1 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )

Move it to manually created pipelines/data_science/pipeline.py, and then import ds_pipeline_1 and ds_pipeline_2 in registry, and registred. Would this make this pipeline modular?

stichbury commented 11 hours ago

I'm happy to do a final read through when further requested changes are complete. Please just ping me @DimedS when you're ready!

DimedS commented 9 hours ago

Many thanks for your review, @yury-fedotov, and your comments, @astrojuanlu. I mostly agree with @astrojuanlu's understanding that modular pipelines are currently a more virtual concept - just a folder structure. For me, clarifying exactly what that means is not as important. I actually wanted to remove the term "modular" entirely to not to confuse users. What's more important to me is to clarify for users how to effectively use Kedro to solve their different tasks.

I like the current docs structure from that perspective because it's more task-oriented now:

  1. In the intro, we explain that a pipeline is a combination of nodes with auto-execution order resolution.
  2. In the modular pipelines section, we discuss the benefits of storing pipelines in separate folders, teach how to use the pipeline create command, and describe the general pipeline structure.
  3. In the last and most complicated section, we explain how to reuse pipelines with namespaces. This is definitely an advanced-level section, and I hope it is now clearer for beginners. @yury-fedotov, I will address your other comments today.