kedro-org / kedro-viz

Visualise your Kedro data and machine-learning pipelines and track your experiments.
https://demo.kedro.org
Apache License 2.0
648 stars 106 forks source link

Build DAG without importing the code #1742

Open astrojuanlu opened 5 months ago

astrojuanlu commented 5 months ago

Description

kedro-viz has lots of heavy dependencies. At the same time, it needs to import the pipeline code to be able to function, even when doing an initial export with --save-file. This means that sometimes using Kedro Viz is difficult or impossible if Viz dependencies clash with the project dependencies, which can happen often.

One outstanding example of that has been the push for Pydantic v2 support #1603.

Another example, @inigohidalgo says "due to the heavy deps from viz i usually have my dev venv but I create another one just for viz where i just install viz over whatever project I have installed, overriding the project's dependencies with viz's" and asks "do you know if anybody has tested using kedro viz as an "app", so installing it through pipx or smth similar? is that even possible with how viz works?". https://linen-slack.kedro.org/t/16380121/question-regarding-kedro-viz-why-is-there-a-restriction-on-p#38213e99-ba9d-4b60-9001-c0add0e2555b

Possible Implementation

One way to do it is to tell Kedro users to write their pipelines in YAML https://github.com/kedro-org/kedro/issues/650, https://github.com/kedro-org/kedro/issues/1963

Possible Alternatives

Another way would be to do some sort of AST scanning of the Python code, assuming that in some cases this would fail or not be accurate.

Yet another way would be to extract the minimal amount of code that does the --save-file and decouple it from the web application that serves it with --load-file.

There are possibly other alternatives.

Checklist

astrojuanlu commented 5 months ago

Tangentially related: https://openlineage.io/ (as a means to export Kedro pipelines)

noklam commented 5 months ago

I've seen Openlineage in a few issues, but is it related to this? From what I understand it's more about understanding the lineage between systems, how data flows from different databases/table to downstream application etc.

datajoely commented 5 months ago

I think some of the concepts in this ticket are relevant too https://github.com/kedro-org/kedro-viz/issues/1459

datajoely commented 5 months ago

The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data.

ravi-kumar-pilla commented 3 months ago

The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data.

Hi @datajoely ,

I started looking at the issue and I am pretty new to the Spark environment. I tried testing the Kedro starter project spaceflights-pyspark-viz which uses kedro-datasets -> spark.SparkDataset .

For this project, the minimum steps required to get kedro viz up were -

  1. Install kedro
  2. Install kedro-viz
  3. Install starter project dependencies ( If these are not present, kedro-viz fails to create a kedro session since we eagerly check for imports in KedroSession.create() -> validate_settings() )
  4. Run command kedro viz run

I know starter project might not give me the full picture of the issue. It would be great if we can connect or you can point me to any kedro project which uses full Spark installation to process data.

Thank you

ravi-kumar-pilla commented 3 months ago

Hi @astrojuanlu,

Regarding this ticket of building DAG without importing the code, needs a significant refactor as we heavily depend on kedro session to load data. I would like to take this in 3 steps -

  1. Reduce the minimum requirements for kedro viz, while still dependent on kedro - https://github.com/kedro-org/kedro-viz/issues/1783
  2. Create a kedro source file for visualization (which depends on kedro session and outputs a yaml/json file). This can be some kind of cli command which outputs the file (as pointed out in the alternative solution above --save-file and --load-file)
  3. Build DAG by parsing the kedro source file ( This will just have the flowchart view without the metadata or experiment tracking ) - https://github.com/kedro-org/kedro-viz/issues/1742, https://github.com/kedro-org/kedro-viz/issues/1459

I have few questions regarding kedro session -

  1. While running kedro viz run, since most of the time we do not intend to change any parameters but just get some data about the kedro project, is the Kedro session still needs to be created by the plugins ?
  2. I found in the kedro docs that, plugins may request information regarding kedro project by creating a session. Is there a way to get the project details like (pipelines, nodes etc) without actually creating a kedro session ?

Thank you

datajoely commented 3 months ago

Great work Ravi - to articulate my point a bit better:

astrojuanlu commented 3 months ago

Quick answers:

ravi-kumar-pilla commented 3 months ago
  • Not sure what a source file is @ravi-kumar-pilla , could you clarify? Is it something like a kedro export that is then read by Kedro-Viz?

@astrojuanlu , Yes. At this moment, we need to know the information regarding pipelines which is only possible by having all the kedro project dependencies resolved. i.e.,

We use _ProjectPipelines class -> find_pipelines() which hasimportlib.import_module(pipeline_module_name). The importing fails if any kedro-project dependency is not resolved. If there is anyway to extract the pipeline information, it would be great.

I am trying to use ast module and extracting the information without resolving dependencies [WIP]. Happy to hear any alternatives.

Thank you

datajoely commented 3 months ago

I think this is the right approach - I know @imdoroshenko has had success with the libcst library too

datajoely commented 3 months ago

One further point - I think this sessionless pipeline construction should live in kedro core longer term rather than just in Viz, lots of uses for other purposes.

noklam commented 3 months ago

"Is there a way to get the project details like (pipelines, nodes etc) without actually creating a kedro session ?" paging @noklam

@astrojuanlu https://github.com/noklam/kedro-viz-lite, glad you asked. I'd love to see kedro-viz become more lightweight. I attempt to make it works on Notebook before (forgot if I end up make it successfully, but it still required session). Interesting I just see #1459 exist,

My opinion: hooks should be disabled by default, unless there is a reason hooks are necessary to run kedro-viz? I suggested this to be the default, but it end up being implemented as an additional flag and off by default.

The parsing approach is interesting and love to learn more, though I don't think working with ast library directly is the correct approach. I see this is still in Backlog, did we start working on this already?

Not sure what a source file is @ravi-kumar-pilla , could you clarify? Is it something like a kedro export that is then read by Kedro-Viz?

This is basically kedro viz --to-json, you need to install the dependencies to "export" the pipeline, but you don't need the library requirements to run kedro viz. This works already.

datajoely commented 3 months ago

Yeah perhaps AST isn't needed - the actual pipeline objects are valid Python without the context, catalog etc yet initialized. So yes all you need is the results of find_pipelines() to bootstrap viz and maybe the complex stuff can asynchronously load when ready.

I'd love to imagine a future where the kedro viz run --autoreload functionality is instant, this would help us get there.

rashidakanchwala commented 3 months ago

My opinion: hooks should be disabled by default, unless there is a reason hooks are necessary to run kedro-viz? I suggested this to be the default, but it end up being implemented as an additional flag and off by default.

I am also of the same opinion. Can the default be no hook, but an additional flag to turn on hooks. I understand this will be a breaking change but the number of uses who use dynamic pipelines is probably a smaller number and for them to enable this would simple be adding --include-hooks

ravi-kumar-pilla commented 3 months ago

My opinion: hooks should be disabled by default, unless there is a reason hooks are necessary to run kedro-viz? I suggested this to be the default, but it end up being implemented as an additional flag and off by default.

I am also of the same opinion. Can the default be no hook, but an additional flag to turn on hooks. I understand this will be a breaking change but the number of uses who use dynamic pipelines is probably a smaller number and for them to enable this would simple be adding --include-hooks

Sure we can ignore hooks by default if it only affects fewer users. Let me create a ticket. Thanks !

noklam commented 3 months ago

Approach 1 - exporting the pipeline documenting some discussion I had before:

Approach 2 - Problem with ast:

Approach 3 - Parser Approach:

I am quite confident the approach 3 will work, but the effort won't be small(maybe 2 weeks for a Prototype?). I have a small PoC with the parser but there are limited time that I can commit outside of work for this. I'd love to work on this if this get prioritised but LSP is my first priority after review :P.

p.s.(what I am saying is assign a 13 point estimate and put me on the ticket in the next two/three months. ๐Ÿ˜† )

astrojuanlu commented 3 months ago

I think this is the right approach - I know @imdoroshenko has had success with the libcst library too

I don't think we need a Concrete Syntax Tree for this, since we don't need to retain comments or formatting. An Abstract Syntax Tree should in theory suffice, or am I missing something?

Problem with ast: If the module is not "importable", then you won't have a ast

I'm confused. Doesn't ast.parse take a string? One doesn't need to import the module itself.

For example:

In [4]: import test_parser.pipelines.data_processing
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 1
----> 1 import test_parser.pipelines.data_processing

File ~/Projects/QuantumBlackLabs/tmp/test-parser/src/test_parser/pipelines/data_processing/__init__.py:3
      1 """Complete Data Processing pipeline for the spaceflights tutorial"""
----> 3 from .pipeline import create_pipeline  # NOQA

File ~/Projects/QuantumBlackLabs/tmp/test-parser/src/test_parser/pipelines/data_processing/pipeline.py:1
----> 1 from kedro.pipeline import Pipeline, node, pipeline
      3 from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles
      6 def create_pipeline(**kwargs) -> Pipeline:

ModuleNotFoundError: No module named 'kedro'

In [5]:                                                                                                                                                               
Do you really want to exit ([y]/n)? ^D
๏…น ๏ผ ~/Projects/QuantumBlackLabs/tmp/test-parser ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท ๏‰’ 1m 17s ๎œผ test-parser 08:19:0
โฏ python -m ast src/test_parser/pipelines/data_processing/pipeline.py 
Module(
   body=[
      ImportFrom(
         module='kedro.pipeline',
         names=[
            alias(name='Pipeline'),
            alias(name='node'),
            alias(name='pipeline')],
         level=0),
      ImportFrom(
         module='nodes',
         names=[
            alias(name='create_model_input_table'),
            alias(name='preprocess_companies'),
            alias(name='preprocess_shuttles')],
         level=1),
      FunctionDef(
         name='create_pipeline',
         args=arguments(
            posonlyargs=[],
            args=[],
            kwonlyargs=[],
            kw_defaults=[],
            kwarg=arg(arg='kwargs'),
...

๏…น ๏ผ ~/Projects/QuantumBlackLabs/tmp/test-parser ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท ๎œผ test-parser 08:17:4
โฏ ipython

In [1]: import ast

In [2]: with open("src/test_parser/pipelines/data_processing/pipeline.py") as fh:
   ...:     tree = ast.parse(fh.read())
   ...: 

In [9]: pipeline_func_nodes = []
   ...: 
   ...: class PipelineLocator(ast.NodeVisitor):
   ...:     def visit_FunctionDef(self, node):
   ...:         if node.name == "create_pipeline":
   ...:             pipeline_func_nodes.append(node)
   ...:         self.generic_visit(node)
   ...: 

In [10]: PipelineLocator().visit(tree)

In [11]: pipeline_func_nodes
Out[11]: [<ast.FunctionDef at 0x103a7fe20>]

In [12]: print(ast.dump(pipeline_func_nodes[0], indent=2))
FunctionDef(
  name='create_pipeline',
  args=arguments(
    posonlyargs=[],
    args=[],
    kwonlyargs=[],
    kw_defaults=[],
    kwarg=arg(arg='kwargs'),
    defaults=[]),
  body=[
    Return(
      value=Call(
        func=Name(id='pipeline', ctx=Load()),
        args=[
          List(
            elts=[
              Call(
                func=Name(id='node', ctx=Load()),
                args=[],
                keywords=[
                  keyword(
                    arg='func',
                    value=Name(id='preprocess_companies', ctx=Load())),
                  keyword(
                    arg='inputs',
                    value=Constant(value='companies')),
                  keyword(
                    arg='outputs',
                    value=Constant(value='preprocessed_companies')),
                  keyword(
                    arg='name',
                    value=Constant(value='preprocess_companies_node'))]),
              Call(
                func=Name(id='node', ctx=Load()),
                args=[],
                keywords=[
                  keyword(
                    arg='func',
                    value=Name(id='preprocess_shuttles', ctx=Load())),
                  keyword(
                    arg='inputs',
                    value=Constant(value='shuttles')),
                  keyword(
                    arg='outputs',
                    value=Constant(value='preprocessed_shuttles')),
                  keyword(
                    arg='name',
                    value=Constant(value='preprocess_shuttles_node'))]),
              Call(
                func=Name(id='node', ctx=Load()),
                args=[],
                keywords=[
                  keyword(
                    arg='func',
                    value=Name(id='create_model_input_table', ctx=Load())),
                  keyword(
                    arg='inputs',
                    value=List(
                      elts=[
                        Constant(value='preprocessed_shuttles'),
                        Constant(value='preprocessed_companies'),
                        Constant(value='reviews')],
                      ctx=Load())),
                  keyword(
                    arg='outputs',
                    value=Constant(value='model_input_table')),
                  keyword(
                    arg='name',
                    value=Constant(value='create_model_input_table_node'))])],
            ctx=Load())],
        keywords=[]))],
  decorator_list=[],
  returns=Name(id='Pipeline', ctx=Load()))

This of course is only the beginning, one then needs to keep visiting the node to "unwind" the pipeline definition. What happens in the create_pipeline function can be quite funky too, think of namespaced pipelines for example (incorrectly called "modular pipelines").

Long story short, a POC would be something that works for "canonical" pipeline definitions like

def create_pipeline():
    # No other variables
    return pipeline([
        node(...)  # Everything are inline constants
    ])

How to get from this 80/20 thing to something that is more robust for real world pipeline definitions is a big mistery.

That's why my initial proposal stated AST as an alternative solution.

Possible Implementation

One way to do it is to tell Kedro users to write their pipelines in YAML kedro-org/kedro#650, kedro-org/kedro#1963

Possible Alternatives

Another way would be to do some sort of AST scanning of the Python code, assuming that in some cases this would fail or not be accurate.


Approach 3 - Parser Approach: I am quite confident the approach 3 will work, but the effort won't be small(maybe 2 weeks for a Prototype?)

I am not sure what custom parsing capabilities you're referring to but I think we should stay away from the business of parsing Python code. 2 weeks for a Prototype sounds like something that can get out of hand pretty quickly.

noklam commented 3 months ago

I'm confused. Doesn't ast.parse take a string? One doesn't need to import the module itself.

@astrojuanlu you are right about this. If we don't care about comment/docstring etc we can go with ast, if you need to preserve other things then maybe CST or something else.

astrojuanlu commented 2 weeks ago

An internal user asked about this

Is it there a way to run kedro-viz without installing the kedro project dependency? I found it very useful to use use kedro-viz to navigate kedro project pipeline when learning a new project on Day 1. But sometimes the project might have certain requirements that I don't have access to it (temporarily or forever) and cause failure in installation. This consequently will create error in running kedro viz. Therefore, I was thinking if there's a work-around or "light" way to run kedro viz in these scenario :slightly_smiling_face:

noklam commented 2 weeks ago

Is this work already started?

ravi-kumar-pilla commented 2 weeks ago

Is this work already started?

Hi @noklam, I wanted to start some research in this sprint with ast (https://github.com/kedro-org/kedro-viz/issues/1742#issuecomment-2036314590) but did not get time to explore. I would also like to get your thoughts on this. Let's connect next week and discuss when you are free.

Thank you