Refactor data access manager for modular pipelines change

kedro-org / kedro-viz

Visualise your Kedro data and machine-learning pipelines and track your experiments.

https://demo.kedro.org

Apache License 2.0

646 stars 106 forks source link

Refactor data access manager for modular pipelines change #1939

Closed ravi-kumar-pilla closed 14 hours ago

ravi-kumar-pilla commented 3 weeks ago

Description

Partially resolves #1899

Development notes

Modify add_pipeline method to use populate_tree in the modular_pipelines repository to create a modular pipeline tree using KedroPipeline
Modify add_node_input and add_node_output to take the modular_pipelines_repo_obj which contains the pre-populated modular pipelines tree (earlier we used to construct the tree at a later stage using expand_tree method of - package/kedro_viz/services/modular_pipelines.py)
Modify add_node to take modular_pipelines_repo_obj as input
Modify add_dataset to take modular_pipelines_repo_obj as input
Remove the expand_tree service call from create_modular_pipelines_tree_for_registered_pipeline method as the tree is already constructed
Update tests to use the edge case scenarios mentioned in :
- https://github.com/kedro-org/kedro-viz/pull/1651
- https://github.com/kedro-org/kedro-viz/issues/1814

QA notes

This PR is part of a bigger refactor (https://github.com/kedro-org/kedro-viz/pull/1897) of modular pipelines and is created to ease review process.
The CI build might fail as this PR is not self-sufficient

Checklist

[x] Read the contributing guidelines
[ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[ ] Updated the documentation to reflect the code changes
[ ] Added new entries to the RELEASE.md file
[x] Added tests to cover my changes

merelcht commented 2 weeks ago

Can you explain why it is now necessary to pass the modular_pipelines_repo_obj of type ModularPipelinesRepository in some of the methods here? With the refactor it seems like the ModularPipelinesRepository has a different function than the other repository classes/objects.

ravi-kumar-pilla commented 2 weeks ago

Can you explain why it is now necessary to pass the modular_pipelines_repo_obj of type ModularPipelinesRepository in some of the methods here? With the refactor it seems like the ModularPipelinesRepository has a different function than the other repository classes/objects.

Hi @merelcht , Every node in the flowchart model of Kedro-Viz has a 2 way mapping i.e., modular pipeline repository contains a map - node_mod_pipeline_map and the node contains a field modular_pipelines. Earlier we used to have the below logic to gather the modular_pipelines. With the refactor, we are constructing the modular pipelines ahead of creating the nodes and we build the dict - node_mod_pipeline_map. This dict already contains the details of the modular pipelines the node belongs to. So we are passing the modular_pipelines_repo_obj while creating the nodes.

    @field_validator("modular_pipelines")
    @classmethod
    def set_modular_pipelines(cls, _, info: ValidationInfo):
        return cls._expand_namespaces(info.data["kedro_obj"].namespace)

merelcht commented 1 week ago

Thanks @ravi-kumar-pilla this is clearer now. My main concern was whether the flow of data still adheres to the architecture (https://miro.com/app/board/uXjVOktl9TY=/ and https://user-images.githubusercontent.com/2032984/118201082-98dcfb80-b44e-11eb-8e63-d7cc7254d1d1.png) and as far as I can tell it does.

From what I understand: before the refactoring the modular pipelines where populated as part of a call to the REST API. This would in turn call the expand_tree() method in the services/modular_pipelines.py.

Now instead the modular pipelines are populated in the same way as the "regular" pipelines, the catalog and dataset stats from inside populate_data() that happens on the server side when the server is started. In fact this happens at the same time as the regular pipelines are populated. The reason why the ModularPipelinesRepository object is then used as an argument to subsequent methods in the call stack, is because it's used to determine the nodes + inputs + outputs.

ravi-kumar-pilla commented 1 week ago

Thanks @ravi-kumar-pilla this is clearer now. My main concern was whether the flow of data still adheres to the architecture (https://miro.com/app/board/uXjVOktl9TY=/ and https://user-images.githubusercontent.com/2032984/118201082-98dcfb80-b44e-11eb-8e63-d7cc7254d1d1.png) and as far as I can tell it does.

Yes @merelcht , the data flow is not changed.

From what I understand: before the refactoring the modular pipelines where populated as part of a call to the REST API. This would in turn call the expand_tree() method in the services/modular_pipelines.py.

Yes

Now instead the modular pipelines are populated in the same way as the "regular" pipelines, the catalog and dataset stats from inside populate_data() that happens on the server side when the server is started. In fact this happens at the same time as the regular pipelines are populated. The reason why the ModularPipelinesRepository object is then used as an argument to subsequent methods in the call stack, is because it's used to determine the nodes + inputs + outputs.

Yes, the ModularPipelinesRepository object is now passed to the create_node methods to assign the modular_pipelines property on the GraphNode.

ravi-kumar-pilla commented 4 days ago

I think this is nearly there. My main suggestion is to make the asserts in the tests more explicit so instead of something like:
assert list(pipelines) == create_list(pipelines)
you do:
assert ["car_pipeline", "plane_pipeline"] == create_list(pipelines)
This explicitness makes the tests easier to read.

Hi @merelcht ,

I understand the edge case test cases are not straightforward to read in the assert statements like you pointed here. I tried to add additional comments to improve readability in the new commit.

Since these tests are drawing the expected json from the expected_tree_for_edge_cases , it will be lot of hardcoding of strings in the test file if I extract the details from the json.

I find this easy to manage in-case there is a change in the modular pipeline tree. Please have a look at the changes in my new commit and let me know if you feel there could be a better way to handle this.

Thank you

merelcht commented 1 day ago

I understand the edge case test cases are not straightforward to read in the assert statements like you pointed here. I tried to add additional comments to improve readability in the new commit.

Since these tests are drawing the expected json from the expected_tree_for_edge_cases , it will be lot of hardcoding of strings in the test file if I extract the details from the json.

I find this easy to manage in-case there is a change in the modular pipeline tree. Please have a look at the changes in my new commit and let me know if you feel there could be a better way to handle this.

I think this is a step into the right direction. Personally, I think hardcoding to enhance test readability is better and forces you to check that the code is still correct when you make changes, because you'll have to manually update your tests. However, since I'm not working on this codebase full time I leave the final decision on how much further to improve the tests up to you and the rest of the Viz team.

ravi-kumar-pilla commented 1 day ago

I think this is a step into the right direction. Personally, I think hardcoding to enhance test readability is better and forces you to check that the code is still correct when you make changes, because you'll have to manually update your tests. However, since I'm not working on this codebase full time I leave the final decision on how much further to improve the tests up to you and the rest of the Viz team.

Thank you @merelcht for taking time and reviewing the PR. Rashida and I had a discussion and she suggested of having one integration test for the edge cases to check the entire api response instead of checking individual properties. I have made the changes in this commit .

@rashidakanchwala , I removed the individual tests here as we already test for example_pipelines in test_add_pipelines. Please have a look. Thank you