kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Pipeline trees - instead of kedro viz, just a tree chart #1233

Closed eepgwde closed 2 years ago

eepgwde commented 2 years ago

Description

I'm working on a large kedro system. Dozen or so pipelines, 200 tables, 1000 parameters.

I want to see how a change is table in raw or intermediate will affect other tables. I used kedro viz, but it isn't working for me at the moment and it isn't what I really want. I want a command-line dependency checker.

kedro why-not raw_sales

I want to see all tables dependent on it. int_sales pri_sales_orders

Similarly for parameters. And I want it in some machine readable form: JSON or some XML.

With package management systems like apt, there are dependency checks: systemctl has list-dependencies on a target, aptitude has why and why-not, for example

aptitude why openssh-client i task-ssh-server Depends openssh-server
i A openssh-server Depends openssh-client (= 1:7.9p1-10+deb10u2)

Advanced DBMSes have a means of interrogating schema and drilling into them.

Context

Why is this change important to you? How would you use it? How can it benefit other users?

I'm working on a large kedro system. Dozen or so pipelines, 200 tables, 1000 parameters. It's multi-vendor, it's multi-developer. kedro viz is useful, if you can get it working, but it isn't easy to report with. A text tool would be more reliable.

Possible Implementation

The information is there because viz must use it. It is in the catalog via the pipelines - the inputs and outputs.

Possible Alternatives

There may be a text version of tree that kedro viz uses within kedro viz. There may be a whole manual section that describes how to do what I want!

antonymilne commented 2 years ago

This is actually pretty easy to implement using the following ingredients:

  1. Pipeline has several useful methods for filtering, including to_outputs and from_inputs, which respectively give all the ancestor and descendant nodes for given dataset name(s)
  2. The nodes property returns nodes in the "correct" (i.e. topologically sorted) order
  3. Node exposes the dataset inputs and outputs
  4. You can add custom commands to your project cli.py
  5. If you want to reuse this across projects, you can turn your custom command into a pip installable plugin

Here's a simple example of what the code in your project cli.py could look like:

import itertools
import json

import click

from kedro.framework.cli.utils import CONTEXT_SETTINGS
from kedro.framework.project import pipelines

@click.group(context_settings=CONTEXT_SETTINGS, name=__file__)
def cli():
    pass

@cli.command()
@click.argument("dataset", nargs=1)
@click.option("--pipeline", default="__default__", help="Optional name of the pipeline to use to look at lineage. Defaults to `__default__`.")
def lineage(dataset: str, pipeline: str):
    """Show ancestors and descendants of `dataset` in `pipeline`."""
    ancestor_nodes = pipelines[pipeline].to_outputs(dataset).nodes
    descendant_nodes = pipelines[pipeline].from_inputs(dataset).nodes

    ancestor_datasets = itertools.chain.from_iterable(node.inputs for node in ancestor_nodes)
    descendant_datasets = itertools.chain.from_iterable(node.outputs for node in descendant_nodes)

    click.echo(json.dumps({"ancestors": list(ancestor_datasets), "descendants": list(descendant_datasets)}, indent=2))

Example output on this demo project, running kedro lineage model_input_table:

{
  "ancestors": [
    "reviews",
    "params:typing.reviews.columns_as_floats",
    "companies",
    "shuttles",
    "data_ingestion.int_typed_companies",
    "data_ingestion.int_typed_shuttles",
    "data_ingestion.prm_agg_companies",
    "data_ingestion.int_typed_reviews",
    "prm_spine_table",
    "prm_shuttle_company_reviews",
    "params:feature.derived",
    "prm_shuttle_company_reviews",
    "params:feature.static",
    "prm_spine_table",
    "feature_engineering.feat_static_features",
    "feature_engineering.feat_derived_features"
  ],
  "descendants": [
    "X_train",
    "X_test",
    "y_train",
    "y_test",
    "train_evaluation.linear_regression.regressor",
    "train_evaluation.hyperparams_linear_regression",
    "train_evaluation.random_forest.regressor",
    "train_evaluation.hyperparams_random_forest",
    "train_evaluation.r2_score_linear_regression",
    "train_evaluation.r2_score_random_forest"
  ]
}
datajoely commented 2 years ago

What I would also say is that the modern modular pipeline pattern with namespaces allows you to simplify the complexity of the graph and arbitrarily nest pipelines. Have a look at an example here http://demo.kedro.org/

merelcht commented 2 years ago

Hi @eepgwde , have you been able to solve your issue with the suggestions from @AntonyMilneQB and @datajoely or do you need more help?

merelcht commented 2 years ago

Closing this for now. @eepgwde feel free to re-open the issue if you need more help with it!