Closed astrojuanlu closed 1 month ago
Tangentially related: https://openlineage.io/ (as a means to export Kedro pipelines)
I've seen Openlineage in a few issues, but is it related to this? From what I understand it's more about understanding the lineage between systems, how data flows from different databases/table to downstream application etc.
I think some of the concepts in this ticket are relevant too https://github.com/kedro-org/kedro-viz/issues/1459
The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data.
The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data.
Hi @datajoely ,
I started looking at the issue and I am pretty new to the Spark environment. I tried testing the Kedro starter project spaceflights-pyspark-viz which uses kedro-datasets -> spark.SparkDataset .
For this project, the minimum steps required to get kedro viz up were -
KedroSession.create()
-> validate_settings()
)kedro viz run
I know starter project might not give me the full picture of the issue. It would be great if we can connect or you can point me to any kedro project which uses full Spark installation to process data.
Thank you
Hi @astrojuanlu,
Regarding this ticket of building DAG without importing the code, needs a significant refactor as we heavily depend on kedro session to load data. I would like to take this in 3 steps -
I have few questions regarding kedro session -
kedro viz run
, since most of the time we do not intend to change any parameters but just get some data about the kedro project, is the Kedro session still needs to be created by the plugins ?plugins may request information regarding kedro project by creating a session
. Is there a way to get the project details like (pipelines, nodes etc) without actually creating a kedro session ?Thank you
Great work Ravi - to articulate my point a bit better:
hooks.py
which initialises the JVM and creates a spark session singleton to use going forward.Quick answers:
kedro export
that is then read by Kedro-Viz?
- Not sure what a source file is @ravi-kumar-pilla , could you clarify? Is it something like a
kedro export
that is then read by Kedro-Viz?
@astrojuanlu , Yes. At this moment, we need to know the information regarding pipelines which is only possible by having all the kedro project dependencies resolved. i.e.,
We use _ProjectPipelines class
-> find_pipelines()
which hasimportlib.import_module(pipeline_module_name)
. The importing fails if any kedro-project dependency is not resolved. If there is anyway to extract the pipeline information, it would be great.
I am trying to use ast
module and extracting the information without resolving dependencies [WIP]. Happy to hear any alternatives.
Thank you
I think this is the right approach - I know @imdoroshenko has had success with the libcst library too
One further point - I think this sessionless pipeline construction should live in kedro core longer term rather than just in Viz, lots of uses for other purposes.
"Is there a way to get the project details like (pipelines, nodes etc) without actually creating a kedro session ?" paging @noklam
@astrojuanlu https://github.com/noklam/kedro-viz-lite, glad you asked. I'd love to see kedro-viz become more lightweight. I attempt to make it works on Notebook before (forgot if I end up make it successfully, but it still required session). Interesting I just see #1459 exist,
pipelines
(and nodes) are easy to get, you don't need session thanks to from kedro.framework.project import pipelines
( I know you have an opinion about this @astrojuanlu :P)catalog
and OmegaConfConfigLoader
need session and settings
, sure you can construct them manually but there is not much points about it, we already have an option to skip hooks so session
isn't an overhead here if I understand My opinion: hooks
should be disabled by default, unless there is a reason hooks
are necessary to run kedro-viz? I suggested this to be the default, but it end up being implemented as an additional flag and off by default.
The parsing approach is interesting and love to learn more, though I don't think working with ast
library directly is the correct approach. I see this is still in Backlog, did we start working on this already?
Not sure what a source file is @ravi-kumar-pilla , could you clarify? Is it something like a kedro export that is then read by Kedro-Viz?
This is basically kedro viz --to-json
, you need to install the dependencies to "export" the pipeline, but you don't need the library requirements to run kedro viz. This works already.
Yeah perhaps AST isn't needed - the actual pipeline
objects are valid Python without the context, catalog etc yet initialized. So yes all you need is the results of find_pipelines()
to bootstrap viz and maybe the complex stuff can asynchronously load when ready.
I'd love to imagine a future where the kedro viz run --autoreload
functionality is instant, this would help us get there.
My opinion:
hooks
should be disabled by default, unless there is a reasonhooks
are necessary to run kedro-viz? I suggested this to be the default, but it end up being implemented as an additional flag and off by default.
I am also of the same opinion. Can the default be no hook, but an additional flag to turn on hooks
. I understand this will be a breaking change but the number of uses who use dynamic pipelines is probably a smaller number and for them to enable this would simple be adding --include-hooks
My opinion:
hooks
should be disabled by default, unless there is a reasonhooks
are necessary to run kedro-viz? I suggested this to be the default, but it end up being implemented as an additional flag and off by default.I am also of the same opinion. Can the default be no hook, but an additional flag to turn on
hooks
. I understand this will be a breaking change but the number of uses who use dynamic pipelines is probably a smaller number and for them to enable this would simple be adding--include-hooks
Sure we can ignore hooks by default if it only affects fewer users. Let me create a ticket. Thanks !
Approach 1 - exporting the pipeline documenting some discussion I had before:
pipelines
object is available it's easy for kedro-viz to visualise the DAG
Approach 2 - Problem with ast:
importlib
only import the module as a whole and you cannot inject logic in between. (Think of keep hitting Ctrl + Enter in a notebook no matter there are error or not)Approach 3 - Parser Approach:
import_module_functions
that act like import_module
except it ignores all other import statement and assignment. find_pipelines
or something simliar to import with the import_module_functions
.I am quite confident the approach 3 will work, but the effort won't be small(maybe 2 weeks for a Prototype?). I have a small PoC with the parser but there are limited time that I can commit outside of work for this. I'd love to work on this if this get prioritised but LSP is my first priority after review :P.
p.s.(what I am saying is assign a 13 point estimate and put me on the ticket in the next two/three months. ๐ )
I think this is the right approach - I know @imdoroshenko has had success with the libcst library too
I don't think we need a Concrete Syntax Tree for this, since we don't need to retain comments or formatting. An Abstract Syntax Tree should in theory suffice, or am I missing something?
Problem with ast: If the module is not "importable", then you won't have a ast
I'm confused. Doesn't ast.parse
take a string? One doesn't need to import
the module itself.
For example:
In [4]: import test_parser.pipelines.data_processing
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[4], line 1
----> 1 import test_parser.pipelines.data_processing
File ~/Projects/QuantumBlackLabs/tmp/test-parser/src/test_parser/pipelines/data_processing/__init__.py:3
1 """Complete Data Processing pipeline for the spaceflights tutorial"""
----> 3 from .pipeline import create_pipeline # NOQA
File ~/Projects/QuantumBlackLabs/tmp/test-parser/src/test_parser/pipelines/data_processing/pipeline.py:1
----> 1 from kedro.pipeline import Pipeline, node, pipeline
3 from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles
6 def create_pipeline(**kwargs) -> Pipeline:
ModuleNotFoundError: No module named 'kedro'
In [5]:
Do you really want to exit ([y]/n)? ^D
๏
น ๏ผ ~/Projects/QuantumBlackLabs/tmp/test-parser ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท ๏ 1m 17s ๎ผ test-parser 08:19:0
โฏ python -m ast src/test_parser/pipelines/data_processing/pipeline.py
Module(
body=[
ImportFrom(
module='kedro.pipeline',
names=[
alias(name='Pipeline'),
alias(name='node'),
alias(name='pipeline')],
level=0),
ImportFrom(
module='nodes',
names=[
alias(name='create_model_input_table'),
alias(name='preprocess_companies'),
alias(name='preprocess_shuttles')],
level=1),
FunctionDef(
name='create_pipeline',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
kwarg=arg(arg='kwargs'),
...
๏
น ๏ผ ~/Projects/QuantumBlackLabs/tmp/test-parser ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท ๎ผ test-parser 08:17:4
โฏ ipython
In [1]: import ast
In [2]: with open("src/test_parser/pipelines/data_processing/pipeline.py") as fh:
...: tree = ast.parse(fh.read())
...:
In [9]: pipeline_func_nodes = []
...:
...: class PipelineLocator(ast.NodeVisitor):
...: def visit_FunctionDef(self, node):
...: if node.name == "create_pipeline":
...: pipeline_func_nodes.append(node)
...: self.generic_visit(node)
...:
In [10]: PipelineLocator().visit(tree)
In [11]: pipeline_func_nodes
Out[11]: [<ast.FunctionDef at 0x103a7fe20>]
In [12]: print(ast.dump(pipeline_func_nodes[0], indent=2))
FunctionDef(
name='create_pipeline',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
kwarg=arg(arg='kwargs'),
defaults=[]),
body=[
Return(
value=Call(
func=Name(id='pipeline', ctx=Load()),
args=[
List(
elts=[
Call(
func=Name(id='node', ctx=Load()),
args=[],
keywords=[
keyword(
arg='func',
value=Name(id='preprocess_companies', ctx=Load())),
keyword(
arg='inputs',
value=Constant(value='companies')),
keyword(
arg='outputs',
value=Constant(value='preprocessed_companies')),
keyword(
arg='name',
value=Constant(value='preprocess_companies_node'))]),
Call(
func=Name(id='node', ctx=Load()),
args=[],
keywords=[
keyword(
arg='func',
value=Name(id='preprocess_shuttles', ctx=Load())),
keyword(
arg='inputs',
value=Constant(value='shuttles')),
keyword(
arg='outputs',
value=Constant(value='preprocessed_shuttles')),
keyword(
arg='name',
value=Constant(value='preprocess_shuttles_node'))]),
Call(
func=Name(id='node', ctx=Load()),
args=[],
keywords=[
keyword(
arg='func',
value=Name(id='create_model_input_table', ctx=Load())),
keyword(
arg='inputs',
value=List(
elts=[
Constant(value='preprocessed_shuttles'),
Constant(value='preprocessed_companies'),
Constant(value='reviews')],
ctx=Load())),
keyword(
arg='outputs',
value=Constant(value='model_input_table')),
keyword(
arg='name',
value=Constant(value='create_model_input_table_node'))])],
ctx=Load())],
keywords=[]))],
decorator_list=[],
returns=Name(id='Pipeline', ctx=Load()))
This of course is only the beginning, one then needs to keep visiting the node to "unwind" the pipeline definition. What happens in the create_pipeline
function can be quite funky too, think of namespaced pipelines for example (incorrectly called "modular pipelines").
Long story short, a POC would be something that works for "canonical" pipeline definitions like
def create_pipeline():
# No other variables
return pipeline([
node(...) # Everything are inline constants
])
How to get from this 80/20 thing to something that is more robust for real world pipeline definitions is a big mistery.
That's why my initial proposal stated AST as an alternative solution.
Possible Implementation
One way to do it is to tell Kedro users to write their pipelines in YAML kedro-org/kedro#650, kedro-org/kedro#1963
Possible Alternatives
Another way would be to do some sort of AST scanning of the Python code, assuming that in some cases this would fail or not be accurate.
Approach 3 - Parser Approach: I am quite confident the approach 3 will work, but the effort won't be small(maybe 2 weeks for a Prototype?)
I am not sure what custom parsing capabilities you're referring to but I think we should stay away from the business of parsing Python code. 2 weeks for a Prototype sounds like something that can get out of hand pretty quickly.
I'm confused. Doesn't ast.parse take a string? One doesn't need to import the module itself.
@astrojuanlu you are right about this. If we don't care about comment/docstring etc we can go with ast
, if you need to preserve other things then maybe CST or something else.
An internal user asked about this
Is it there a way to run kedro-viz without installing the kedro project dependency? I found it very useful to use use kedro-viz to navigate kedro project pipeline when learning a new project on Day 1. But sometimes the project might have certain requirements that I don't have access to it (temporarily or forever) and cause failure in installation. This consequently will create error in running kedro viz. Therefore, I was thinking if there's a work-around or "light" way to run kedro viz in these scenario :slightly_smiling_face:
Is this work already started?
Is this work already started?
Hi @noklam, I wanted to start some research in this sprint with ast (https://github.com/kedro-org/kedro-viz/issues/1742#issuecomment-2036314590) but did not get time to explore. I would also like to get your thoughts on this. Let's connect next week and discuss when you are free.
Thank you
Just copying the comment I left in the discussion.
๐งตThe problem statement is how to get rid of the unwanted imports the solutions proposed are (from my understanding):
- Pure static AST - fail to address import/runtime/loop etc (Original proposal)
- Mocking import so it somehow ignores these importing error (Joel's)
- Refactor Kedro core to construct pipeline as AST and kedro-viz read the AST instead (Deepyaman's suggestion)
- NodeTransformer (Ivan's comment - modify the AST to ignore the imports)
^ I think what's clear from the discussion is that a pure static approach is proven to be difficult and error-prone with edge cases. We cannot get rid of actually executing the code, but instead we should think about "how to execute part of the code that we are interested"
ast.NodeTransformer
)There is also a comment about what to mock, we don't want to mock import that are importing pipeline from other modules, or constant that construct pipeline dynamically (Question is: How do we know which one are important? Is there a way to identify them)?
@sbrugman Also brought up a good point, kedro-viz in CI/CD would benefit a lot with lightweight dependencies without the full project dependencies.
Closed in #1966 !
Description
kedro-viz has lots of heavy dependencies. At the same time, it needs to
import
the pipeline code to be able to function, even when doing an initial export with--save-file
. This means that sometimes using Kedro Viz is difficult or impossible if Viz dependencies clash with the project dependencies, which can happen often.One outstanding example of that has been the push for Pydantic v2 support #1603.
Another example, @inigohidalgo says "due to the heavy deps from viz i usually have my dev venv but I create another one just for viz where i just install viz over whatever project I have installed, overriding the project's dependencies with viz's" and asks "do you know if anybody has tested using kedro viz as an "app", so installing it through pipx or smth similar? is that even possible with how viz works?". https://linen-slack.kedro.org/t/16380121/question-regarding-kedro-viz-why-is-there-a-restriction-on-p#38213e99-ba9d-4b60-9001-c0add0e2555b
Possible Implementation
One way to do it is to tell Kedro users to write their pipelines in YAML https://github.com/kedro-org/kedro/issues/650, https://github.com/kedro-org/kedro/issues/1963
Possible Alternatives
Another way would be to do some sort of AST scanning of the Python code, assuming that in some cases this would fail or not be accurate.
Yet another way would be to extract the minimal amount of code that does the
--save-file
and decouple it from the web application that serves it with--load-file
.There are possibly other alternatives.
Checklist