kedro-org / kedro-viz

Visualise your Kedro data and machine-learning pipelines and track your experiments.
https://demo.kedro.org
Apache License 2.0
676 stars 111 forks source link

Enable Kedro-Viz functionality through a notebook, without Kedro Framework. #1459

Closed NeroOkwa closed 6 months ago

NeroOkwa commented 1 year ago

Description

Make it possible to use Kedro-Viz (pipeline visualisation and experiment tracking) without Kedro framework by using a notebook.

For example I will be able to build a pipeline in notebook and have nodes that output metrics; I will be able to %run_viz and Kedro-Viz will open up with a view of my pipeline and experiments.

Context

Currently, Kedro-Viz is tightly coupled with Kedro framework making it impossible for non-kedro users to use Kedro-Viz. This was highlighted as a pain point in the experiment tracking user research:

"In this case if I really like experiment tracking I might not consider using it if it isn't a kedro project... I am not sure it is a good direction to go with it being completely integrated, especially if there is a new thing like Mlflow"

Secondly, from the non-technical user research https://github.com/kedro-org/kedro-viz/issues/1280 we discovered a group of 'low-code' users that only use notebooks ( e.g. Data Analyst, J. Data Scientist, Researchers). This is a sizeable group (estimated at 70%) within data teams. Providing a notebook access to Kedro-Viz would make it easier for these users to use Kedro-Viz.

What's happening?

If I wanted to use Kedro-Viz in a notebook, without Kedro Framework then this would not be possible. So if I had a setup like this:

my-project
├── my-notebook.ipynb
├── Customer-Churn-Records.csv
├── parameters.yml
├── catalog.yml
└── requirements.txt

Then I’d never be able to see a pipeline visualisation even if, I had: requirements.txt

kedro==0.18.11
kedro-viz==6.3.3
kedro-datasets[pandas.CSVDataSet]~=1.1

my-notebook.ipynb

from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline, node
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from typing import Dict
import logging
import pandas as pd

### Insert something new to load catalog.yml and parameters.yml

def preprocess_data(data: pd.DataFrame) -> pd.DataFrame:
    data = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])
    le = LabelEncoder()
    data['Gender'] = le.fit_transform(data['Gender'])
    data = pd.get_dummies(data, columns=['Geography', 'Card Type'])
    return data

def split_data(data: pd.DataFrame, test_size: float, random_state: int) -> Dict:
    X = data.drop(columns='Exited')
    y = data['Exited']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return dict(train=(X_train, y_train), test=(X_test, y_test))

def train_model(train: Dict, random_state: int) -> RandomForestClassifier:
    X_train, y_train = train['train']
    rf_clf = RandomForestClassifier(random_state=random_state)
    rf_clf.fit(X_train, y_train)
    return rf_clf

def evaluate_model(model: RandomForestClassifier, test: Dict) -> None:
    X_test, y_test = test['test']
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)
    log = logging.getLogger(__name__)
    log.info("Model Accuracy: %s", accuracy)
    log.info("Confusion Matrix: \n%s", confusion_mat)
    log.info("Classification Report: \n%s", class_report)

my_pipeline = pipeline([
        node(preprocess_data, "customers", "preprocessed_customers"),
                node(split_data, ["preprocessed_customers", "params:test_size", "params:random_state"], "split_data"),
        node(train_model, ["split_data", "params:random_state"], "rf_model"),
        node(evaluate_model, ["rf_model", "split_data"], None),
    ])

%run_viz my_pipeline

It should be possible to see the following in another cell in my Jupyter notebook, with the option to open it up in another tab:

Screenshot 2023-07-18 at 11 55 07 (1)

Outcome

A user will be able to use Kedro-Viz from a notebook, without the need/setup of a Kedro framework.

Evidence markers

datajoely commented 1 year ago

I love this!

yetudada commented 1 year ago

I love this!

What do you love about this? 😄

datajoely commented 1 year ago

I think I have two thoughts -

  1. This is a neat way of making Kedro Viz useful to people who don't want the complexity of the IDE and may be a stepping stone to getting people into that space.

  2. The second point is something I know others have mentioned before - it annoys me that we need to actually load a valid Kedro project with all of it's imports and dependencies just to visualise the pipeline flow. Kedro Viz (in my mind) should load instantly, you shouldn't have to wait for Spark to spin up (especially because you can't run the pipeline anyway). I've long thought Kedro should be able to create a session lazily so you can read the pipeline structure for Viz cheaply without incurring the other costs.

astrojuanlu commented 1 year ago

Idea: a kedro-openlineage plugin that emits static OpenLineage metadata events, either in ndjson format or to an HTTP endpoint, which are then consumed by Kedro Viz. This is possible with openlineage-python 1.0, released yesterday.

datajoely commented 1 year ago

100000% also lots of LFAI projects there we should deffo do this

datajoely commented 1 year ago

This thread on Slack shows a user wanting to merge Viz from 3 different Kedro projects that can't exist side by side since they have conflicting dependencies. Kedro Viz doesn't need to run this, it just needs to visualise the pipeline structure: https://linen-slack.kedro.org/t/14142730/hi-everyone-is-it-possible-to-combine-multiple-kedro-project#d84d8f45-eecc-4c1b-b639-4556c1edcd76

noklam commented 6 months ago

I realised I didn't leave a comment here. I created this last year https://github.com/noklam/kedro-viz-lite. I actually don't remember if I succeed at the end, the logic are mostly in https://github.com/noklam/kedro-viz-lite/blob/main/kedro_viz_lite/core.py.

This lead to my subsequent proposal for the kedro viz build and kedro viz GH page.

My use case for this is explore Pipeline structure, particular when I need to confirm my pipeline works as expected with namespace. The alternative of this is creating a full-blown Kedro project which is a lot of boilerplate. What I care is just the DAGs, and it should be enough as long as I have the DataCatalog and Pipeline. It's also because kedro viz is kind of slow to start up, thus making it hard when I just want to debug quickly. (--reload sometimes just break completely if I have an incomplete Kedro Project)

If this add a bit context, I was writing https://noklam.github.io/blog/posts/understand_namespace/2023-09-26-understand-kedro-namespace-pipeline.html when I think about this.