Lightweight Kedro Viz Experimentation using AST

Description

Related to #1742

Kedro-Viz has lots of heavy dependencies. At the same time, it needs to import the pipeline code to be able to function, even when doing an initial export with --save-file. This means that sometimes using Kedro-Viz is difficult or impossible if Viz dependencies clash with the project dependencies, which can happen often.

One example of that has been the push for Pydantic v2 support https://github.com/kedro-org/kedro-viz/issues/1603.

Another example, @inigohidalgo says "due to the heavy deps from viz i usually have my dev venv but I create another one just for viz where i just install viz over whatever project I have installed, overriding the project's dependencies with viz's" and asks "do you know if anybody has tested using kedro viz as an "app", so installing it through pipx or smth similar? is that even possible with how viz works?". https://linen-slack.kedro.org/t/16380121/question-regarding-kedro-viz-why-is-there-a-restriction-on-p#38213e99-ba9d-4b60-9001-c0add0e2555b

The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data.

Development notes

Added an option --lite in Kedro-Viz CLI. When users execute the command kedro viz --lite , it takes the approach mentioned below -

Using AST + Mock Imports :

Steps:

Parse the Kedro project using AST
Locate all the import statements
Try importing the located statements
Mock the dependencies in-case of an import error
Patch sys modules with the mocked modules before retrieving the pipelines information from Kedro.

Testing:

I have tested basic Kedro projects with -

[x] Dataset factory patterns
[x] Starter projects
[x] Demo-Project on Kedro-Viz
[x] MLOps project
[x] Spark pet project with spark.driver.host configured to localhost. The idea was to test spark initialization both via hooks and outside of hooks
[x] Dynamic Pipelines project getindata tutorial

Observations:

On macOS Sonoma (2.4 GHz 8-Core Intel i9, 64GB) - These observations might differ as my system was a bit slow while doing tests. But this should give a basic idea of performance. All the tests are run using time <command>. To summarize, kedro viz --lite was faster (~10-15sec) though get_mocked_modules() took (~1-2sec), initializing DataCatalogLite instead of DataCatalog saved time.

**Project: demo_project**

Run 1 - kedro viz  2.69s user 0.71s system 7% cpu 47.519 total
Run 2 - kedro viz  2.78s user 0.75s system 7% cpu 44.181 total

Run 1 - kedro viz --lite  2.79s user 0.74s system 7% cpu 46.126 total
Run 2 - kedro viz --lite  2.76s user 0.74s system 7% cpu 44.393 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.43s user 0.93s system 9% cpu 37.082 total
Run 2 - kedro viz --lite  2.32s user 0.67s system 10% cpu 28.553 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  1.76s user 0.35s system 86% cpu 2.433 total

**Project: MLOps**

Run 1 - kedro viz  4.48s user 1.86s system 11% cpu 54.817 total
Run 2 - kedro viz  4.52s user 1.59s system 13% cpu 46.505 total

Run 1 - kedro viz --lite  3.98s user 1.42s system 24% cpu 22.159 total
Run 2 - kedro viz --lite  3.99s user 1.42s system 19% cpu 28.372 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.35s user 0.71s system 11% cpu 27.291 total
Run 2 - kedro viz --lite  2.34s user 0.69s system 13% cpu 23.242 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  0.14s user 0.12s system 15% cpu 1.658 total

**Project: Pyspark dummy project**

Run 1 - kedro viz  3.66s user 1.32s system 6% cpu 1:14.60 total
Run 2 - kedro viz  3.22s user 1.05s system 32% cpu 13.233 total

Run 1 - kedro viz --lite  3.70s user 1.18s system 16% cpu 29.910 total
Run 2 - kedro viz --lite  3.18s user 1.07s system 32% cpu 13.083 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.38s user 0.75s system 10% cpu 28.534 total
Run 2 - kedro viz --lite  2.03s user 0.60s system 23% cpu 11.011 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  0.51s user 0.18s system 54% cpu 1.269 total

**Project: spaceflights with dynamic pipelines**

Run 1 - kedro viz  2.73s user 0.89s system 7% cpu 46.366 total
Run 2 - kedro viz  2.41s user 0.60s system 20% cpu 14.372 total

Run 1 - kedro viz --lite  2.73s user 0.73s system 9% cpu 35.341 total
Run 2 - kedro viz --lite  2.42s user 0.62s system 11% cpu 27.509 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.00s user 0.57s system 16% cpu 15.596 total
Run 2 - kedro viz --lite  1.98s user 0.57s system 21% cpu 11.801 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  0.52s user 0.21s system 53% cpu 1.369 total

[!NOTE] I have also performed monitoring using line_profiler which gave similar results. This ticket may not improve kedro viz performance but makes it run with missing external dependencies. However, we can improve the overall performance once https://github.com/kedro-org/kedro-viz/pull/1920 and https://github.com/kedro-org/kedro-viz/pull/1920#issuecomment-2292672816 are implemented

Limitations:

If the datasets are not resolved in the catalog, they will be defaulted to MemoryDataset
Since MemoryDatasets do not have layer information, the layers will not be shown in the flowchart if the datasets are not resolved
Experiment Tracking will not work if the datasets are not resolved and the pre-requisite of having kedro-datasets version 2.1.0 and above is not met.
The metadata panel for a data node shows the data node type as MemoryDataset if the dataset is not resolved

Next Steps:
- [x] Once this PR is reviewed and good to go, I will add documentation as needed in a separate PR.
- [x] MemoryDataset to custom ImportErrorDataset in-case of missing dataset dependencies
- [x] CLI showing a warning message with missing imports (if there are any missing imports) - This is already done, but need to confirm with @stephkaiser
- [x] FE notice with a warning of missing imports. This needs a separate ticket and discussion with @stephkaiser
- [ ] Word about the --lite flag. Once this PR is merged and we have the above tasks complete, I will demo this feature in the Coffee chat (Sep 1st or 2nd week).

QA notes

Steps to test -

Create a new conda env - conda create -n viz-parser-test python=3.11
Activate the created env - conda activate viz-parser-test
Install Kedro - pip install kedro
Install Kedro-Viz (current parser branch) -

git clone https://github.com/kedro-org/kedro-viz.git
cd kedro-viz
git checkout feature/kedro-viz-lite

pip install -e package

Create a spaceflights starter project - kedro new --starter=spaceflights-pandas
Navigate to your Kedro project - cd spaceflights-pandas
Run kedro viz
It throws an error -

raise DatasetError(
kedro.io.core.DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'pandas.CSVDataset' not found, is this a typo?

Run kedro viz --lite
Kedro Viz should start successfully

Credits:

Thank you Kedro TSC for suggesting the approach taken in this PR
Thank you @noklam for having an initial look at the PR and suggesting the sys modules patch and custom DataCatalog implementation
Thank you Kedro Team for writing an awesome TestSuite for DataCatalog which I reused in DataCatalogLite for coverage. (Let me know if there is a better way to do instead of duplicating the test code)

Checklist

[x] Read the contributing guidelines
[ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[ ] Updated the documentation to reflect the code changes
[ ] Added new entries to the RELEASE.md file
[x] Added tests to cover my changes

Adding previous notes here for reference -

Experiment with AST:

The core part of Kedro-Viz that needs the dependencies of Kedro project is while loading the pipelines information.

What we do now -

def load_data(
    project_path: Path,
    env: Optional[str] = None,
    include_hooks: bool = False,
    package_name: Optional[str] = None,
    extra_params: Optional[Dict[str, Any]] = None,
) -> Tuple[DataCatalog, Dict[str, Pipeline], BaseSessionStore, Dict]:
    """Load data from a Kedro project."""
    if package_name:
        configure_project(package_name)
    else:
        # bootstrap project when viz is run in dev mode
        bootstrap_project(project_path)

    with KedroSession.create(
        project_path=project_path,
        env=env,
        save_on_close=False,
        extra_params=extra_params,
    ) as session:
        # check for --include-hooks option
        if not include_hooks:
            session._hook_manager = _VizNullPluginManager()  # type: ignore

        context = session.load_context()
        session_store = session._store
        catalog = context.catalog

        # Pipelines is a lazy dict-like object, so we force it to populate here
        # in case user doesn't have an active session down the line when it's first accessed.
        # Useful for users who have `get_current_session` in their `register_pipelines()`.
        pipelines_dict = dict(pipelines)
        stats_dict = _get_dataset_stats(project_path)

    return catalog, pipelines_dict, session_store, stats_dict

For pipelines we have -

pipelines = _ProjectPipelines()

    @staticmethod
    def _get_pipelines_registry_callable(pipelines_module: str) -> Any:
        module_obj = importlib.import_module(pipelines_module)
        register_pipelines = getattr(module_obj, "register_pipelines")
        return register_pipelines

Dependency issues with the above code -

Creating a Kedro session requires the settings module to be importable
_get_pipelines_registry_callable also tries to import the pipelines module by using find_pipelines()
Any import errors in the above steps will be thrown and Kedro-Viz will not be able to load the pipeline data.

Possible solutions -

From the TSC discussion on July 10, 2024, thanks to the team for suggesting an approach to mock imports. I tested this approach and it works well.

Using AST + Mock Imports :

Steps:
1. Parse the Kedro project using AST
2. Locate all the import statements
3. Try importing the located statements
4. Mock the dependencies in-case of an import error
5. Update the sys modules with the mocked modules before retrieving the pipelines information from Kedro.

What does this PR do - Introduce an option --lite in Kedro-Viz CLI. When users try to execute the command kedro viz --lite , it uses the approach mentioned above.

NOTE: To re-use most of the data_loader code while creating Kedro Session, Context and Catalog with mocked dependencies we may need to modify the from_config of DataCatalog in Kedro Framework -

Lines 306 - 311 : To use the dataCatalog as-is in lite version we need the exception to return a MemoryDataset (TODO: Need to see if this can be achieved via a hook on viz side)

try:
    datasets[ds_name] = AbstractDataset.from_config(
        ds_name, ds_config, load_versions.get(ds_name), save_version
    )
except DatasetError as exc:
    datasets[ds_name] = MemoryDataset()

This will allow us to initialize DataCatalog without any warnings (example - WARNING Cannot find parameter feature_engineering.feature.derived in the catalog). If this is not viable, we can create a DataCatalog() instance and load parameters using - parameters = conf_loader["parameters"] and add it to catalog somehow -

feed_dict = self._get_feed_dict()
catalog.add_feed_dict(feed_dict)

[UPDATE] : An update on the above issue of missing datasets due to issues parsing catalog entries. I had a discussion with Nok and decided to implement a custom DataCatalog i.e., DataCatalogLite which overrides from_config and returns a MemoryDataset when a DatasetError occurs.

Using AST [Rejected] : We parse the kedro project using AST, locate the pipelines, create Kedro nodes and build a Kedro pipeline. [seed]

What does this PR do - Introduce an option --lite in Kedro-Viz CLI. When users try to execute the command kedro viz --lite , it uses the AST parser to load pipelines.

def load_data(
    project_path: Path,
    env: Optional[str] = None,
    include_hooks: bool = False,
    package_name: Optional[str] = None,
    extra_params: Optional[Dict[str, Any]] = None,
    is_lite: bool = False,
) -> Tuple[DataCatalog, Dict[str, Pipeline], BaseSessionStore, Dict]:
    """Load data from a Kedro project"""
    if is_lite:
        # [TODO: Confirm on the context creation]
        context = KedroContext(
            package_name="{{ cookiecutter.python_package }}",
            project_path=project_path,
            config_loader=OmegaConfigLoader(conf_source=str(project_path)),
            hook_manager=_VizNullPluginManager(),
            env=env,
        )

        # [TODO: Confirm on the session store creation]
        session_store = None

        # [TODO: Confirm on the DataCatalog creation]
        catalog = DataCatalog()

        stats_dict = _get_dataset_stats(project_path)
        pipelines_dict = dict(parse_project(project_path))
        return catalog, pipelines_dict, session_store, stats_dict
    else:
        ...

Kedro parser -

def parse_project(project_path: Path) -> Dict[str, Pipeline]
...

Takes Kedro project path as input
Loops through all the .py files in the project
Creates an instance of KedroPipelineLocator
Makes a call to visit method of KedroPipelineLocator
Returns a pipelines dictionary

class KedroPipelineLocator(ast.NodeVisitor):
    """
    Represents a pipeline that is located when parsing
    the Kedro project's `create_pipeline` function

    """
    def __init__(self):
        self.pipeline = None
    def visit_FunctionDef(self, node):
        try:
            if node.name == "create_pipeline":
                # Explore the located pipeline for nodes
                # and other keyword args
                kedro_pipeline_explorer = KedroPipelineExplorer()
                kedro_pipeline_explorer.visit(node)
         ...

KedroPipelineLocator inherits ast.NodeVisitor class and walks through the py files
When a create_pipeline node is encountered, it creates an instance of KedroPipelineExplorer and calls visit
KedroPipelineLocator class has a pipeline field which has the located pipeline

class KedroPipelineExplorer(ast.NodeVisitor):
    # [TODO: Current explorer only serves for 1 pipeline() function within a create_pipeline def]
    def __init__(self):
        # keeping these here for future use-case
        # when dealing with multiple `pipeline()` functions
        # within a create_pipeline def
        self.nodes: List[Node] = []
        self.inputs = None
        self.outputs = None
        self.namespace = None
        self.parameters = None
        self.tags = None

    def visit_Call(self, node):
        if isinstance(node.func, ast.Name) and node.func.id == "pipeline":
         ...

KedroPipelineExplorer class parses a pipeline node
A simple example pipeline can be -

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ]
    )

The corresponding AST -

FunctionDef(
  name='create_pipeline',
  args=arguments(
    posonlyargs=[],
    args=[],
    kwonlyargs=[],
    kw_defaults=[],
    kwarg=arg(arg='kwargs'),
    defaults=[]),
  body=[
    Return(
      value=Call(
        func=Name(id='pipeline', ctx=Load()),
        args=[
          List(
            elts=[
              Call(
                func=Name(id='node', ctx=Load()),
                args=[],
                keywords=[
                  keyword(
                    arg='func',
                    value=Name(id='preprocess_companies', ctx=Load())),
                  keyword(
                    arg='inputs',
                    value=Constant(value='companies')),
                  keyword(
                    arg='outputs',
                    value=Constant(value='preprocessed_companies')),
                  keyword(
                    arg='name',
                    value=Constant(value='preprocess_companies_node'))]),

We will create a Kedro Node using the tree above by extracting the node func args and kwargs
We will also extract the pipeline func args and kwargs
Once we create all the Kedro Nodes, we create a Kedro Pipeline. The pipeline can be a modular/namespace pipeline if the pipeline has a kwarg of namespace (WIP as node can also have a namespace)

Challenges -

Kedro Node has a kwarg func which is of type Callable -

Since we are dealing with strings, it is hard to compile the function without installing any dependency. The workaround in this PR dynamically creates a function with the name stored in func_name using exec. The exec function executes the string as Python code in the context of the globals() dictionary. This means that a new function named func_name with arbitrary arguments (*args and **kwargs) and an empty body (pass) is created in the global scope.

if keyword.arg == "func":
    func_name = keyword.value.id
    exec(f"def {func_name}(*args, **kwargs): pass", globals())
    node_func = globals()[func_name]

Parsing args and kwargs values -

The values of args/kwargs of a node/pipeline can contain complex types like reference types, comprehensions, formatted strings, Joined strings etc. as shown below - (demo_project -> modelling -> pipeline.py)

def create_pipeline(model_types: List[str]) -> Pipeline:
    test_train_refs = ["X_train", "X_test", "y_train", "y_test"]

    # Split the model_input data
    split_stage_pipeline = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:split_options"],
                outputs=test_train_refs, # variable reference
            )
        ]
    )

    # Instantiate a new modeling pipeline for every model type
    model_pipelines = [
        pipeline(
            pipe=new_train_eval_template(), # Function call
            parameters={"dummy_model_options": f"model_options.{model_type}"}, # formatted strings
            inputs={k: k for k in test_train_refs},  # comprehension + variable reference
            namespace=model_type,
        )
        for model_type in model_types
    ]

    # Combine modeling pipeliens into one pipeline object
    all_modeling_pipelines = sum(model_pipelines)

    # Namespace consolidated modeling pipelines
    consolidated_model_pipelines = pipeline(
        pipe=all_modeling_pipelines,
        namespace="train_evaluation",
        inputs=test_train_refs, # variable reference
    )

    # Combine split and modeling stages into one pipeline
    complete_model_pipeline = split_stage_pipeline + consolidated_model_pipelines
    return complete_model_pipeline

The solution for these complex types needs to be explored. This PR contains helper functions parse_value, evaluate_ast_node, extract_variables_from_function to deal with some of the complex types.

Dealing with namespaces, multiple pipeline() functions, having more than 1 create_pipeline within a pipeline.py need to be explored.

Next Steps:

Discuss this during Tech Design scheduled on July 10, 2024
Explore alternatives if any
If we decide on AST approach, explore on the challenges
Set/Decide on the expectations for the initial release

Thanks a lot @ravi-kumar-pilla for the super detailed writeups!! 👏🏼

One question on the approach, just by quickly reading your first comment:

Mock the dependencies in-case of an import error

Have you considered unconditionally mock the dependencies? So that --lite is always faster than the regular version, whether the user has the dependencies or not.

One question on the approach, just by quickly reading your first comment:

Mock the dependencies in-case of an import error

Have you considered unconditionally mock the dependencies? So that --lite is always faster than the regular version, whether the user has the dependencies or not.

Good question @astrojuanlu, Yes I did consider that, but I was thinking of partial environments where users have some dependencies resolved. But we can mock everything if we as a team, decide that the --lite version should do so. One challenge would be to not mock pipelines that are imported from relative or other projects. I personally feel we can mock only the missing imports. Thank you

One challenge would be to not mock pipelines that are imported from relative or other projects.

Yeah, good point...

My concern here is that, if I understand correctly, users with fully working environments won't see any differences between --lite and regular, am I right?

My concern here is that, if I understand correctly, users with fully working environments won't see any differences between --lite and regular, am I right?

Yes, that is correct. Yesterday Rashida and I had a similar discussion and thought if we should make this as default viz implementation (i.e., making kedro viz do what kedro viz --lite does). One drawback with this would be, if the user is aware that he/she has all the imports, the parsing using AST might take some additional time for huge projects, which may not be needed in normal working. Thank you

Maybe we should implement both modes then:

(Current)
"Lite": mock imports if they fail
"Litest": mock all imports except pipelines

and try to get users to test both.

In any case, I understand the difficulty of the "litest" mode, so it can be left for a future PR. On the other hand, we shouldn't change the default mode just yet, and make --lite opt-in for now until we're confident that it helps.

Please re-request me when this is ready to be reviewed.

I managed to fix the build issues, quickly test on it:

I don't see the MagicMock function anymore, and seems like all the mocking is done correctly.
"Show Code" feature works properly.

Moving some conversation here for records:

since demo_project (any kedro project on which we run kedro viz --lite) is not available in the python environment, all absolute paths of the package like demo_project.pipelines.reporting.nodes will say No module named 'demo_project' when using importlib.util.find_spec(module_name) To resolve this, I tried doing bootstrap_project(project_path). This helped in getting demo_project recognizable as a module

The important thing here is making the package is in sys.path, which is what bootstrap_project do. bootstrap_project will however load the project via configure_project. I suspect this is the reason why sklearn is imported.

Is this intended? Are you only mocking in case there are missing dependencies now? (there were some discussion in TD before so I am not sure what's the latest design). I think this is fine but just want to confirm if it's expected behavior.

Overall I think this is very close already! I find a way to break it.

in nodes.py I add:

from ...pkl import abc # doesn't exist

Then I get an error:

ModuleNotFoundError: No module named 'demo_project.pkl'

This is not a common case so maybe it's not necessary to fix it within the PR. I think the strategy here is:

first check ifdemo_project.pkl` exists
If it exists, mock demo_project.pkl.abc, if not, mock demo_project.pkl and demo_project.pkl.abc

Is this intended? Are you only mocking in case there are missing dependencies now? (there were some discussion in TD before so I am not sure what's the latest design). I think this is fine but just want to confirm if it's expected behavior.

Yes, we are mocking only missing external dependencies

Overall I think this is very close already! I find a way to break it.

in nodes.py I add:
from ...pkl import abc # doesn't exist
Then I get an error:

ModuleNotFoundError: No module named 'demo_project.pkl'

This is not a common case so maybe it's not necessary to fix it within the PR. I think the strategy here is:

first check ifdemo_project.pkl` exists

If it exists, mock demo_project.pkl.abc, if not, mock demo_project.pkl and demo_project.pkl.abc

Yes, I tried mocking relative imports by converting them to absolute and did try different variations. You can find some here . So there are few ways we can handle relative imports -

Only check if the path exists
Actually try importing the module (similar to your suggestion above)

The issue with the approach of importing the module (if you remember from our conversation) where I mentioned -

Now if demo_project.pipelines.reporting.nodes file has an import statement like import seaborn as sn and if seaborn is not available, the module demo_project.pipelines.reporting.nodes also gets mocked as it is not importable (which was happening this morning).

So if we try to check for relative imports, there is a chance we mock the relative import just because there is an un-resolved external dependency in the relative module. Now there are work-arounds for this, (like keep track of the files which have mocked absolute imports etc)

What can we do -

If we decide on handling all the import types external, internal (relative imports that start with dot (eg., ..nodes, .nodes etc) and relative imports from package root dir (eg., demo_project.pipelines.reporting.nodes etc), this may also increase the time it takes for mocking
Try to mock external missing dependencies and leave the responsibility of having relative imports resolvable to the user (like if the user has an import statement like ..nodes from a file, that file should be present at the least). The current status of the PR does this.

Thank you

_mock_missing_dependencies and _create_mock_imports sounds like exactly the same thing. It may be easier to understand to separate it into two steps, one to create a collection of module to be mocked. Then simply look through that collections and create mocks. It probably makes this a bit easier to test as well.

Yes, I will rename and add comments for better understanding. The collection approach is what I initially did. But, I felt the storing of missing dependencies and looping over them again is two-fold and not necessary. In the new approach, I mock the modules on the initial AST parsing.

Also, I will be adding performance metrics as I tested with different use cases -

MLOps - This was done on Juan Lu repo - https://github.com/astrojuanlu/workshop-from-zero-to-mlops/
Spark - I created a dummy project with spark.yml configured to localhost. The idea was to test spark initialization both via hooks and outside of hooks. Since we have --ignore-hooks by default.
Dynamic pipelines - Using getInData tutorial project https://getindata.com/blog/kedro-dynamic-pipelines/

Thank you

[08/23/24 16:33:59] WARNING  Kedro-Viz has mocked the following             data_loader.py:173
                             dependencies for lite-mode.                                      
                             ['sklearn.base', 'matplotlib.pyplot',                            
                             'matplotlib', 'seaborn', 'PIL',                                  
                             'sklearn.model_selection', 'sklearn',                            
                             'sklearn.metrics']                                               
                             In order to get a complete experience of Viz,                    
                             please install the missing Kedro project                         
                             dependencies

💯

kedro-org / kedro-viz