Closed ravi-kumar-pilla closed 1 month ago
Adding previous notes here for reference -
Experiment with AST:
The core part of Kedro-Viz that needs the dependencies of Kedro project is while loading the pipelines information.
What we do now -
def load_data(
project_path: Path,
env: Optional[str] = None,
include_hooks: bool = False,
package_name: Optional[str] = None,
extra_params: Optional[Dict[str, Any]] = None,
) -> Tuple[DataCatalog, Dict[str, Pipeline], BaseSessionStore, Dict]:
"""Load data from a Kedro project."""
if package_name:
configure_project(package_name)
else:
# bootstrap project when viz is run in dev mode
bootstrap_project(project_path)
with KedroSession.create(
project_path=project_path,
env=env,
save_on_close=False,
extra_params=extra_params,
) as session:
# check for --include-hooks option
if not include_hooks:
session._hook_manager = _VizNullPluginManager() # type: ignore
context = session.load_context()
session_store = session._store
catalog = context.catalog
# Pipelines is a lazy dict-like object, so we force it to populate here
# in case user doesn't have an active session down the line when it's first accessed.
# Useful for users who have `get_current_session` in their `register_pipelines()`.
pipelines_dict = dict(pipelines)
stats_dict = _get_dataset_stats(project_path)
return catalog, pipelines_dict, session_store, stats_dict
For pipelines we have -
pipelines = _ProjectPipelines()
@staticmethod
def _get_pipelines_registry_callable(pipelines_module: str) -> Any:
module_obj = importlib.import_module(pipelines_module)
register_pipelines = getattr(module_obj, "register_pipelines")
return register_pipelines
Dependency issues with the above code -
_get_pipelines_registry_callable
also tries to import the pipelines module by using find_pipelines()
Possible solutions -
From the TSC discussion on July 10, 2024, thanks to the team for suggesting an approach to mock imports. I tested this approach and it works well.
Using AST + Mock Imports :
Steps:
What does this PR do - Introduce an option --lite
in Kedro-Viz CLI. When users try to execute the command kedro viz --lite
, it uses the approach mentioned above.
NOTE: To re-use most of the data_loader code while creating Kedro Session, Context and Catalog with mocked dependencies we may need to modify the from_config
of DataCatalog in Kedro Framework -
Lines 306 - 311 : To use the dataCatalog as-is in lite version we need the exception to return a MemoryDataset (TODO: Need to see if this can be achieved via a hook on viz side)
try:
datasets[ds_name] = AbstractDataset.from_config(
ds_name, ds_config, load_versions.get(ds_name), save_version
)
except DatasetError as exc:
datasets[ds_name] = MemoryDataset()
This will allow us to initialize DataCatalog without any warnings (example - WARNING Cannot find parameter feature_engineering.feature.derived
in the catalog). If this is not viable, we can create a DataCatalog()
instance and load parameters using - parameters = conf_loader["parameters"]
and add it to catalog somehow -
feed_dict = self._get_feed_dict()
catalog.add_feed_dict(feed_dict)
[UPDATE] : An update on the above issue of missing datasets due to issues parsing catalog entries. I had a discussion with Nok and decided to implement a custom DataCatalog i.e., DataCatalogLite which overrides from_config
and returns a MemoryDataset when a DatasetError occurs.
What does this PR do - Introduce an option --lite
in Kedro-Viz CLI. When users try to execute the command kedro viz --lite
, it uses the AST parser to load pipelines.
def load_data(
project_path: Path,
env: Optional[str] = None,
include_hooks: bool = False,
package_name: Optional[str] = None,
extra_params: Optional[Dict[str, Any]] = None,
is_lite: bool = False,
) -> Tuple[DataCatalog, Dict[str, Pipeline], BaseSessionStore, Dict]:
"""Load data from a Kedro project"""
if is_lite:
# [TODO: Confirm on the context creation]
context = KedroContext(
package_name="{{ cookiecutter.python_package }}",
project_path=project_path,
config_loader=OmegaConfigLoader(conf_source=str(project_path)),
hook_manager=_VizNullPluginManager(),
env=env,
)
# [TODO: Confirm on the session store creation]
session_store = None
# [TODO: Confirm on the DataCatalog creation]
catalog = DataCatalog()
stats_dict = _get_dataset_stats(project_path)
pipelines_dict = dict(parse_project(project_path))
return catalog, pipelines_dict, session_store, stats_dict
else:
...
Kedro parser -
def parse_project(project_path: Path) -> Dict[str, Pipeline]
...
.py
files in the projectKedroPipelineLocator
visit
method of KedroPipelineLocator
class KedroPipelineLocator(ast.NodeVisitor):
"""
Represents a pipeline that is located when parsing
the Kedro project's `create_pipeline` function
"""
def __init__(self):
self.pipeline = None
def visit_FunctionDef(self, node):
try:
if node.name == "create_pipeline":
# Explore the located pipeline for nodes
# and other keyword args
kedro_pipeline_explorer = KedroPipelineExplorer()
kedro_pipeline_explorer.visit(node)
...
ast.NodeVisitor
class and walks through the py filescreate_pipeline
node is encountered, it creates an instance of KedroPipelineExplorer
and calls visit
pipeline
field which has the located pipelineclass KedroPipelineExplorer(ast.NodeVisitor):
# [TODO: Current explorer only serves for 1 pipeline() function within a create_pipeline def]
def __init__(self):
# keeping these here for future use-case
# when dealing with multiple `pipeline()` functions
# within a create_pipeline def
self.nodes: List[Node] = []
self.inputs = None
self.outputs = None
self.namespace = None
self.parameters = None
self.tags = None
def visit_Call(self, node):
if isinstance(node.func, ast.Name) and node.func.id == "pipeline":
...
pipeline
nodedef create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
)
The corresponding AST -
FunctionDef(
name='create_pipeline',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
kwarg=arg(arg='kwargs'),
defaults=[]),
body=[
Return(
value=Call(
func=Name(id='pipeline', ctx=Load()),
args=[
List(
elts=[
Call(
func=Name(id='node', ctx=Load()),
args=[],
keywords=[
keyword(
arg='func',
value=Name(id='preprocess_companies', ctx=Load())),
keyword(
arg='inputs',
value=Constant(value='companies')),
keyword(
arg='outputs',
value=Constant(value='preprocessed_companies')),
keyword(
arg='name',
value=Constant(value='preprocess_companies_node'))]),
node
func args
and kwargs
pipeline
func args
and kwargs
namespace
(WIP as node can also have a namespace)Challenges -
func
which is of type Callable
- Since we are dealing with strings, it is hard to compile the function without installing any dependency. The workaround in this PR dynamically creates a function with the name stored in func_name using exec. The exec function executes the string as Python code in the context of the globals() dictionary. This means that a new function named func_name with arbitrary arguments (*args and **kwargs) and an empty body (pass) is created in the global scope.
if keyword.arg == "func":
func_name = keyword.value.id
exec(f"def {func_name}(*args, **kwargs): pass", globals())
node_func = globals()[func_name]
The values of args/kwargs of a node/pipeline can contain complex types like reference types, comprehensions, formatted strings, Joined strings etc. as shown below - (demo_project -> modelling -> pipeline.py)
def create_pipeline(model_types: List[str]) -> Pipeline:
test_train_refs = ["X_train", "X_test", "y_train", "y_test"]
# Split the model_input data
split_stage_pipeline = pipeline(
[
node(
func=split_data,
inputs=["model_input_table", "params:split_options"],
outputs=test_train_refs, # variable reference
)
]
)
# Instantiate a new modeling pipeline for every model type
model_pipelines = [
pipeline(
pipe=new_train_eval_template(), # Function call
parameters={"dummy_model_options": f"model_options.{model_type}"}, # formatted strings
inputs={k: k for k in test_train_refs}, # comprehension + variable reference
namespace=model_type,
)
for model_type in model_types
]
# Combine modeling pipeliens into one pipeline object
all_modeling_pipelines = sum(model_pipelines)
# Namespace consolidated modeling pipelines
consolidated_model_pipelines = pipeline(
pipe=all_modeling_pipelines,
namespace="train_evaluation",
inputs=test_train_refs, # variable reference
)
# Combine split and modeling stages into one pipeline
complete_model_pipeline = split_stage_pipeline + consolidated_model_pipelines
return complete_model_pipeline
The solution for these complex types needs to be explored. This PR contains helper functions parse_value
, evaluate_ast_node
, extract_variables_from_function
to deal with some of the complex types.
create_pipeline
within a pipeline.py need to be explored.Next Steps:
Thanks a lot @ravi-kumar-pilla for the super detailed writeups!! 👏🏼
One question on the approach, just by quickly reading your first comment:
- Mock the dependencies in-case of an import error
Have you considered unconditionally mock the dependencies? So that --lite
is always faster than the regular version, whether the user has the dependencies or not.
One question on the approach, just by quickly reading your first comment:
- Mock the dependencies in-case of an import error
Have you considered unconditionally mock the dependencies? So that
--lite
is always faster than the regular version, whether the user has the dependencies or not.
Good question @astrojuanlu, Yes I did consider that, but I was thinking of partial environments where users have some dependencies resolved. But we can mock everything if we as a team, decide that the --lite
version should do so. One challenge would be to not mock pipelines that are imported from relative or other projects. I personally feel we can mock only the missing imports. Thank you
One challenge would be to not mock pipelines that are imported from relative or other projects.
Yeah, good point...
My concern here is that, if I understand correctly, users with fully working environments won't see any differences between --lite
and regular, am I right?
My concern here is that, if I understand correctly, users with fully working environments won't see any differences between
--lite
and regular, am I right?
Yes, that is correct. Yesterday Rashida and I had a similar discussion and thought if we should make this as default viz implementation (i.e., making kedro viz do what kedro viz --lite does). One drawback with this would be, if the user is aware that he/she has all the imports, the parsing using AST might take some additional time for huge projects, which may not be needed in normal working. Thank you
Maybe we should implement both modes then:
and try to get users to test both.
In any case, I understand the difficulty of the "litest" mode, so it can be left for a future PR. On the other hand, we shouldn't change the default mode just yet, and make --lite
opt-in for now until we're confident that it helps.
Please re-request me when this is ready to be reviewed.
I managed to fix the build issues, quickly test on it:
Moving some conversation here for records:
since demo_project (any kedro project on which we run kedro viz --lite) is not available in the python environment, all absolute paths of the package like demo_project.pipelines.reporting.nodes will say No module named 'demo_project' when using importlib.util.find_spec(module_name) To resolve this, I tried doing bootstrap_project(project_path). This helped in getting demo_project recognizable as a module
The important thing here is making the package is in sys.path
, which is what bootstrap_project
do. bootstrap_project will however load the project via configure_project
. I suspect this is the reason why sklearn
is imported.
Is this intended? Are you only mocking in case there are missing dependencies now? (there were some discussion in TD before so I am not sure what's the latest design). I think this is fine but just want to confirm if it's expected behavior.
Overall I think this is very close already! I find a way to break it.
in nodes.py
I add:
from ...pkl import abc # doesn't exist
Then I get an error:
ModuleNotFoundError: No module named 'demo_project.pkl'
This is not a common case so maybe it's not necessary to fix it within the PR. I think the strategy here is:
first check if
demo_project.pkl` existsdemo_project.pkl.abc
, if not, mock demo_project.pkl
and demo_project.pkl.abc
Is this intended? Are you only mocking in case there are missing dependencies now? (there were some discussion in TD before so I am not sure what's the latest design). I think this is fine but just want to confirm if it's expected behavior.
Yes, we are mocking only missing external dependencies
Overall I think this is very close already! I find a way to break it.
in
nodes.py
I add:from ...pkl import abc # doesn't exist
Then I get an error:
ModuleNotFoundError: No module named 'demo_project.pkl'
This is not a common case so maybe it's not necessary to fix it within the PR. I think the strategy here is:
first check if
demo_project.pkl` exists- If it exists, mock
demo_project.pkl.abc
, if not, mockdemo_project.pkl
anddemo_project.pkl.abc
Yes, I tried mocking relative imports by converting them to absolute and did try different variations. You can find some here . So there are few ways we can handle relative imports -
The issue with the approach of importing the module (if you remember from our conversation) where I mentioned -
Now if demo_project.pipelines.reporting.nodes file has an import statement like import seaborn as sn and if seaborn is not available, the module demo_project.pipelines.reporting.nodes also gets mocked as it is not importable (which was happening this morning).
So if we try to check for relative imports, there is a chance we mock the relative import just because there is an un-resolved external dependency in the relative module. Now there are work-arounds for this, (like keep track of the files which have mocked absolute imports etc)
What can we do -
Thank you
_mock_missing_dependencies
and_create_mock_imports
sounds like exactly the same thing. It may be easier to understand to separate it into two steps, one to create a collection of module to be mocked. Then simply look through that collections and create mocks. It probably makes this a bit easier to test as well.
Yes, I will rename and add comments for better understanding. The collection approach is what I initially did. But, I felt the storing of missing dependencies and looping over them again is two-fold and not necessary. In the new approach, I mock the modules on the initial AST parsing.
Also, I will be adding performance metrics as I tested with different use cases -
Thank you
[08/23/24 16:33:59] WARNING Kedro-Viz has mocked the following data_loader.py:173
dependencies for lite-mode.
['sklearn.base', 'matplotlib.pyplot',
'matplotlib', 'seaborn', 'PIL',
'sklearn.model_selection', 'sklearn',
'sklearn.metrics']
In order to get a complete experience of Viz,
please install the missing Kedro project
dependencies
💯
Description
Related to #1742
Kedro-Viz has lots of heavy dependencies. At the same time, it needs to import the pipeline code to be able to function, even when doing an initial export with --save-file. This means that sometimes using Kedro-Viz is difficult or impossible if Viz dependencies clash with the project dependencies, which can happen often.
One example of that has been the push for Pydantic v2 support https://github.com/kedro-org/kedro-viz/issues/1603.
Another example, @inigohidalgo says "due to the heavy deps from viz i usually have my dev venv but I create another one just for viz where i just install viz over whatever project I have installed, overriding the project's dependencies with viz's" and asks "do you know if anybody has tested using kedro viz as an "app", so installing it through pipx or smth similar? is that even possible with how viz works?". https://linen-slack.kedro.org/t/16380121/question-regarding-kedro-viz-why-is-there-a-restriction-on-p#38213e99-ba9d-4b60-9001-c0add0e2555b
The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data.
Development notes
Added an option
--lite
in Kedro-Viz CLI. When users execute the commandkedro viz --lite
, it takes the approach mentioned below -Using AST + Mock Imports :
Steps:
Testing:
I have tested basic Kedro projects with -
spark.driver.host
configured to localhost. The idea was to test spark initialization both via hooks and outside of hooksObservations:
On macOS Sonoma (2.4 GHz 8-Core Intel i9, 64GB) - These observations might differ as my system was a bit slow while doing tests. But this should give a basic idea of performance. All the tests are run using
time <command>
. To summarize,kedro viz --lite
was faster (~10-15sec) thoughget_mocked_modules()
took (~1-2sec), initializingDataCatalogLite
instead ofDataCatalog
saved time.Limitations:
The metadata panel for a data node shows the data node type as MemoryDataset if the dataset is not resolved
Next Steps:
--lite
flag. Once this PR is merged and we have the above tasks complete, I will demo this feature in the Coffee chat (Sep 1st or 2nd week).QA notes
Steps to test -
conda create -n viz-parser-test python=3.11
conda activate viz-parser-test
pip install kedro
kedro new --starter=spaceflights-pandas
cd spaceflights-pandas
kedro viz
kedro viz --lite
Credits:
Checklist
RELEASE.md
file