Create massive pipeline to test with flowchart on Kedro-viz

rashidakanchwala commented 2 years ago

Description

Create a massive kedro-viz pipeline to stress-test flowchart features.

Context

The fluidity of flowchart interactions depends on the size of the pipeline, currently we don't have massive pipelines so we cannot stress tests a lot of features on kedro-viz. We know a lot of data science projects have huge pipelines. This issue is to make sure we build kedro-viz to also handle massive pipelines.

Possible Implementation

Maybe we can just create a big json file with multiple large pipelines

Checklist

[x] Include labels so that we can categorise your feature request

rashidakanchwala commented 2 years ago

@jmholzer recently did this https://github.com/kedro-org/kedro/pull/1795#issuecomment-1232014807 where he tested the runner with 1000 nodes. I am wondering if we can create a json from the pipeline with 1000 nodes and use it for the above.

tynandebold commented 2 years ago

Great idea. Let's try and build this into the demo project so we don't have maintain two data sources.

Thoughts from backlog grooming.

Default pipeline is our current view
In the pipeline dropdown we have an item that, when selected, loops through and generates a massive pipeline.

tynandebold commented 1 year ago

Another idea: find a team that has a massive pipeline and get it from them.

astrojuanlu commented 1 year ago

I know a few of them 😄

tynandebold commented 1 year ago

Please let us know where we can get one!

rashidakanchwala commented 9 months ago

We will use the insurex (QB vertical team) sanitized pipeline for this.

ravi-kumar-pilla commented 6 months ago

Hi Team,

Update:

I reached out to Shubham from CommercialX and got one of their pipeline. He also shared a box link to go over the setup. I have set it up in my local and kedro viz run seems to load pretty normally. Though I had to comment out the Spark session initialization step.

Observations:

If spark session is instantiated without using hooks, ignoring hooks by default will not have affect
Since it is a huge pipeline, having an alignment option of horizontal/vertical nodes should be of great help
If I would like to quickly filter the DAG on dataset type (want to see only SparkDatasets) it is not possible. At this moment our filter panel is limited. We should add more filterable options.
The load time of Kedro-Viz DAG is not bad (for this pipeline at least) . But might take longer due to Spark sessions. (Need to investigate further each step)

I would like to get some help from the framework team (@SajidAlamQB , @ankatiyar if anyone has some time), to speed the process of Spark setup locally and successfully execute kedro run.

Thank you

ravi-kumar-pilla commented 6 months ago

CommercialX Kedro Viz Testing -

Observations:

Populating piplines dict(pipelines) takes 50% of the time to start the server
Kedro Catalog creation takes up considerable time as well

Size of the data -

RUN 1 -

Starting Kedro Viz ... Time taken to configure/bootstrap project:: 2.6968612670898438 Time taken to create a kedro session:: 0.44796109199523926 [04/24/24 19:43:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109 Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12806415557861328 Time taken to create a kedro session store:: 9.5367431640625e-07 Time taken to create a kedro catalog:: 15.315791845321655 [04/24/24 19:44:31] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109 required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 23.553779125213623 Time taken to create stats dictionary:: 7.510185241699219e-05 Time taken to load kedro project data:: 42.1427047252655 Time taken to populate pipelines:: 9.5367431640625e-07 [04/24/24 19:44:33] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 [04/24/24 19:44:34] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 Time taken to populate viz repositories:: 1.3385379314422607 Time taken to start uvicorn server:: 43.49144387245178 Kedro Viz started successfully.

RUN 2 -

Starting Kedro Viz ... Time taken to configure/bootstrap project:: 1.7348659038543701 Time taken to create a kedro session:: 0.2879657745361328 [04/24/24 19:59:22] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109 Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12883210182189941 Time taken to create a kedro session store:: 0.0 Time taken to create a kedro catalog:: 13.26403284072876 [04/24/24 19:59:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109 required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 21.121844053268433 Time taken to create stats dictionary:: 6.508827209472656e-05 Time taken to load kedro project data:: 36.5377631187439 Time taken to populate pipelines:: 1.1920928955078125e-06 [04/24/24 19:59:57] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 Time taken to populate viz repositories:: 1.4388270378112793 Time taken to start uvicorn server:: 37.98678135871887 Kedro Viz started successfully.

Immediate RUN 3 -

Starting Kedro Viz ... Time taken to configure/bootstrap project:: 1.6473729610443115 Time taken to create a kedro session:: 0.2387540340423584 [04/24/24 20:01:57] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109 Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12455415725708008 Time taken to create a kedro session store:: 9.5367431640625e-07 Time taken to create a kedro catalog:: 9.044120073318481 [04/24/24 20:02:15] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109 required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 9.573238134384155 Time taken to create stats dictionary:: 4.982948303222656e-05 Time taken to load kedro project data:: 20.628222227096558 Time taken to populate pipelines:: 9.5367431640625e-07 [04/24/24 20:02:16] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 [04/24/24 20:02:17] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 Time taken to populate viz repositories:: 1.3532860279083252 Time taken to start uvicorn server:: 21.99152898788452 Kedro Viz started successfully.

astrojuanlu commented 6 months ago

Populating piplines dict(pipelines) takes 50% of the time to start the server

Kedro Catalog creation takes up considerable time as well

Good to know. What are the next steps?

The logs are a bit difficult to read. Maybe it would help to see a flamegraph, like this https://github.com/kedro-org/kedro/issues/3033#issue-1895014637

astrojuanlu commented 6 months ago

Also notice that, while testing with internal projects is useful, for us to confidently move forward with this we will probably have to generate some open source synthetic projects to test. See https://github.com/kedro-org/kedro/discussions/3790 for past discussion about this

ravi-kumar-pilla commented 6 months ago

Hi @astrojuanlu , Thank you for the suggestions. I tested with the tools you have mentioned and also prepared a rough notes on the next steps here.

To summarize, as a first step, if we load kedro data in an async way (async loading test branch) would help improve the Kedro-Viz load time for larger pipelines. If there are any new findings on the internal implementation of Kedro, I would be happy to discuss in the next Tech design.

Thank you

astrojuanlu commented 6 months ago

Thanks @ravi-kumar-pilla. To summarize from the internal document:

Insights

It takes a long time to initialise the Kedro modules and reach the actual kedro viz run command (already sort of known, https://github.com/kedro-org/kedro/issues/1476)
The expensive operation before starting the viz server is loading the data from the Kedro session (possibly related to https://github.com/kedro-org/kedro/issues/2829 ?)
Most of the time taken to load the data is from catalog and pipelines_dict resolution, which worsens as the pipeline count increases

Next steps

Stress test with https://github.com/noklam/kedro-example/tree/master/stress-test-pipeline and summarize the results
Check for internals of _get_catalog() and pipelines to further optimize

And if I may add, I think

we need https://github.com/kedro-org/kedro/discussions/3790 to do this properly (beyond @noklam's pipeline linked above), and
the Framework team needs to be involved.

astrojuanlu commented 6 months ago

Adding a bit more context after a quick discussion:

These performance bottlenecks affect all projects, not only large ones, because startup times for Kedro are exceedingly long, and also the data is seemingly loaded in sequence cc @yetudada
We will likely need not 1, but several "massive pipelines" to do a comprehensive performance analysis, where "massive" means
- 1 pipeline with increasingly large number of nodes (essentially https://github.com/kedro-org/kedro/discussions/3790)
- N pipelines of 1 node
- 1 pipeline and 1 node with increasingly large number of datasets

kedro-org / kedro-viz