kedro-org / kedro-viz

Visualise your Kedro data and machine-learning pipelines and track your experiments.
https://demo.kedro.org
Apache License 2.0
647 stars 106 forks source link

While nesting namespace pipelines, intermediary datasets get exposed to top level of the `viz` #1814

Open yury-fedotov opened 3 months ago

yury-fedotov commented 3 months ago

Description

I found out that there might be a potential bug in how kedro viz visualizes nested namespace pipelines. In short, if there is an outer namespace (I will use processing in my example going forward) that has a single input and a single free output, instead of visually collapsing everything in between this input and output in the namespace, kedro viz also exposes datasets shared by the inner namespaced pipelines to the top level of the viz.

Context

I encountered this issue in my project and, as discussed with @rashidakanchwala , opening an issue so the team can have a more detailed look. Also while writing this issue, as you'll see below, I created a very compact example of how you can reproduce this situation.

Steps to Reproduce

Create a new kedro project with viz installed, and make the following pipeline:

from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline

def _get_generic_pipe() -> Pipeline:
    return Pipeline([
        node(
            func=lambda x: x,
            inputs="input_df",
            outputs="output_df",
        ),
    ])

def create_pipeline(**kwargs) -> Pipeline:
    pipe = Pipeline([
        pipeline(
            pipe=_get_generic_pipe(),
            inputs={"input_df": "input_to_processing"},
            outputs={"output_df": "post_first_pipe"},
            namespace="first_processing_step",
        ),
        pipeline(
            pipe=_get_generic_pipe(),
            inputs={"input_df": "post_first_pipe"},
            outputs={"output_df": "output_from_processing"},
            namespace="second_processing_step",
        ),
    ])
    return pipeline(
        pipe=pipe,
        inputs="input_to_processing",
        outputs="output_from_processing",
        namespace="processing",
    )

Then kedro viz run and see that post_first_pipe dataset, which should be fully encapsulated within processing namespace, gets exposed to the top level of viz.

Expected Result

Since post_first_pipe dataset is fully internal to processing namespace, it should be visually encapsulated there and not exposed to the top level of the viz.

Actual Result

What I actually see in the viz is this:

Screenshot 2024-03-19 at 6 08 48 PM

Let me highlight a few things here:

Your Environment

Include as many relevant details as possible about the environment you experienced the bug in:

Checklist