Closed eepgwde closed 2 years ago
This is actually pretty easy to implement using the following ingredients:
Pipeline
has several useful methods for filtering, including to_outputs
and from_inputs
, which respectively give all the ancestor and descendant nodes for given dataset name(s)nodes
property returns nodes in the "correct" (i.e. topologically sorted) orderNode
exposes the dataset inputs
and outputs
cli.py
Here's a simple example of what the code in your project cli.py
could look like:
import itertools
import json
import click
from kedro.framework.cli.utils import CONTEXT_SETTINGS
from kedro.framework.project import pipelines
@click.group(context_settings=CONTEXT_SETTINGS, name=__file__)
def cli():
pass
@cli.command()
@click.argument("dataset", nargs=1)
@click.option("--pipeline", default="__default__", help="Optional name of the pipeline to use to look at lineage. Defaults to `__default__`.")
def lineage(dataset: str, pipeline: str):
"""Show ancestors and descendants of `dataset` in `pipeline`."""
ancestor_nodes = pipelines[pipeline].to_outputs(dataset).nodes
descendant_nodes = pipelines[pipeline].from_inputs(dataset).nodes
ancestor_datasets = itertools.chain.from_iterable(node.inputs for node in ancestor_nodes)
descendant_datasets = itertools.chain.from_iterable(node.outputs for node in descendant_nodes)
click.echo(json.dumps({"ancestors": list(ancestor_datasets), "descendants": list(descendant_datasets)}, indent=2))
Example output on this demo project, running kedro lineage model_input_table
:
{
"ancestors": [
"reviews",
"params:typing.reviews.columns_as_floats",
"companies",
"shuttles",
"data_ingestion.int_typed_companies",
"data_ingestion.int_typed_shuttles",
"data_ingestion.prm_agg_companies",
"data_ingestion.int_typed_reviews",
"prm_spine_table",
"prm_shuttle_company_reviews",
"params:feature.derived",
"prm_shuttle_company_reviews",
"params:feature.static",
"prm_spine_table",
"feature_engineering.feat_static_features",
"feature_engineering.feat_derived_features"
],
"descendants": [
"X_train",
"X_test",
"y_train",
"y_test",
"train_evaluation.linear_regression.regressor",
"train_evaluation.hyperparams_linear_regression",
"train_evaluation.random_forest.regressor",
"train_evaluation.hyperparams_random_forest",
"train_evaluation.r2_score_linear_regression",
"train_evaluation.r2_score_random_forest"
]
}
What I would also say is that the modern modular pipeline pattern with namespaces allows you to simplify the complexity of the graph and arbitrarily nest pipelines. Have a look at an example here http://demo.kedro.org/
Hi @eepgwde , have you been able to solve your issue with the suggestions from @AntonyMilneQB and @datajoely or do you need more help?
Closing this for now. @eepgwde feel free to re-open the issue if you need more help with it!
Description
I'm working on a large kedro system. Dozen or so pipelines, 200 tables, 1000 parameters.
I want to see how a change is table in raw or intermediate will affect other tables. I used kedro viz, but it isn't working for me at the moment and it isn't what I really want. I want a command-line dependency checker.
kedro why-not raw_sales
I want to see all tables dependent on it. int_sales pri_sales_orders
Similarly for parameters. And I want it in some machine readable form: JSON or some XML.
With package management systems like apt, there are dependency checks: systemctl has list-dependencies on a target, aptitude has why and why-not, for example
aptitude why openssh-client i task-ssh-server Depends openssh-server
i A openssh-server Depends openssh-client (= 1:7.9p1-10+deb10u2)
Advanced DBMSes have a means of interrogating schema and drilling into them.
Context
Why is this change important to you? How would you use it? How can it benefit other users?
I'm working on a large kedro system. Dozen or so pipelines, 200 tables, 1000 parameters. It's multi-vendor, it's multi-developer. kedro viz is useful, if you can get it working, but it isn't easy to report with. A text tool would be more reliable.
Possible Implementation
The information is there because viz must use it. It is in the catalog via the pipelines - the inputs and outputs.
Possible Alternatives
There may be a text version of tree that kedro viz uses within kedro viz. There may be a whole manual section that describes how to do what I want!