kedro-org / kedro-viz

Visualise your Kedro data and machine-learning pipelines and track your experiments.
https://demo.kedro.org
Apache License 2.0
646 stars 106 forks source link

Create massive pipeline to test with flowchart on Kedro-viz #1064

Open rashidakanchwala opened 1 year ago

rashidakanchwala commented 1 year ago

Description

Create a massive kedro-viz pipeline to stress-test flowchart features.

Context

The fluidity of flowchart interactions depends on the size of the pipeline, currently we don't have massive pipelines so we cannot stress tests a lot of features on kedro-viz. We know a lot of data science projects have huge pipelines. This issue is to make sure we build kedro-viz to also handle massive pipelines.

Possible Implementation

Maybe we can just create a big json file with multiple large pipelines

Checklist

rashidakanchwala commented 1 year ago

@jmholzer recently did this https://github.com/kedro-org/kedro/pull/1795#issuecomment-1232014807 where he tested the runner with 1000 nodes. I am wondering if we can create a json from the pipeline with 1000 nodes and use it for the above.

tynandebold commented 1 year ago

Great idea. Let's try and build this into the demo project so we don't have maintain two data sources.

Thoughts from backlog grooming.

tynandebold commented 9 months ago

Another idea: find a team that has a massive pipeline and get it from them.

astrojuanlu commented 9 months ago

I know a few of them 😄

tynandebold commented 9 months ago

Please let us know where we can get one!

rashidakanchwala commented 5 months ago

We will use the insurex (QB vertical team) sanitized pipeline for this.

ravi-kumar-pilla commented 2 months ago

Hi Team,

Update:

I reached out to Shubham from CommercialX and got one of their pipeline. He also shared a box link to go over the setup. I have set it up in my local and kedro viz run seems to load pretty normally. Though I had to comment out the Spark session initialization step.

Observations:

  1. If spark session is instantiated without using hooks, ignoring hooks by default will not have affect
  2. Since it is a huge pipeline, having an alignment option of horizontal/vertical nodes should be of great help
  3. If I would like to quickly filter the DAG on dataset type (want to see only SparkDatasets) it is not possible. At this moment our filter panel is limited. We should add more filterable options.
  4. The load time of Kedro-Viz DAG is not bad (for this pipeline at least) . But might take longer due to Spark sessions. (Need to investigate further each step)

I would like to get some help from the framework team (@SajidAlamQB , @ankatiyar if anyone has some time), to speed the process of Spark setup locally and successfully execute kedro run.

Thank you

ravi-kumar-pilla commented 2 months ago

CommercialX Kedro Viz Testing -

Observations:

  1. Populating piplines dict(pipelines) takes 50% of the time to start the server
  2. Kedro Catalog creation takes up considerable time as well

Size of the data -

Image

RUN 1 -

Starting Kedro Viz ... Time taken to configure/bootstrap project:: 2.6968612670898438 Time taken to create a kedro session:: 0.44796109199523926 [04/24/24 19:43:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109 Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12806415557861328 Time taken to create a kedro session store:: 9.5367431640625e-07 Time taken to create a kedro catalog:: 15.315791845321655 [04/24/24 19:44:31] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109 required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 23.553779125213623 Time taken to create stats dictionary:: 7.510185241699219e-05 Time taken to load kedro project data:: 42.1427047252655 Time taken to populate pipelines:: 9.5367431640625e-07 [04/24/24 19:44:33] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 [04/24/24 19:44:34] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 Time taken to populate viz repositories:: 1.3385379314422607 Time taken to start uvicorn server:: 43.49144387245178 Kedro Viz started successfully.

RUN 2 -

Starting Kedro Viz ... Time taken to configure/bootstrap project:: 1.7348659038543701 Time taken to create a kedro session:: 0.2879657745361328 [04/24/24 19:59:22] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109 Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12883210182189941 Time taken to create a kedro session store:: 0.0 Time taken to create a kedro catalog:: 13.26403284072876 [04/24/24 19:59:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109 required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 21.121844053268433 Time taken to create stats dictionary:: 6.508827209472656e-05 Time taken to load kedro project data:: 36.5377631187439 Time taken to populate pipelines:: 1.1920928955078125e-06 [04/24/24 19:59:57] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 Time taken to populate viz repositories:: 1.4388270378112793 Time taken to start uvicorn server:: 37.98678135871887 Kedro Viz started successfully.

Immediate RUN 3 -

Starting Kedro Viz ... Time taken to configure/bootstrap project:: 1.6473729610443115 Time taken to create a kedro session:: 0.2387540340423584 [04/24/24 20:01:57] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109 Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12455415725708008 Time taken to create a kedro session store:: 9.5367431640625e-07 Time taken to create a kedro catalog:: 9.044120073318481 [04/24/24 20:02:15] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109 required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 9.573238134384155 Time taken to create stats dictionary:: 4.982948303222656e-05 Time taken to load kedro project data:: 20.628222227096558 Time taken to populate pipelines:: 9.5367431640625e-07 [04/24/24 20:02:16] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 [04/24/24 20:02:17] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006 Time taken to populate viz repositories:: 1.3532860279083252 Time taken to start uvicorn server:: 21.99152898788452 Kedro Viz started successfully.

astrojuanlu commented 2 months ago
  • Populating piplines dict(pipelines) takes 50% of the time to start the server
  • Kedro Catalog creation takes up considerable time as well

Good to know. What are the next steps?

The logs are a bit difficult to read. Maybe it would help to see a flamegraph, like this https://github.com/kedro-org/kedro/issues/3033#issue-1895014637

astrojuanlu commented 2 months ago

Also notice that, while testing with internal projects is useful, for us to confidently move forward with this we will probably have to generate some open source synthetic projects to test. See https://github.com/kedro-org/kedro/discussions/3790 for past discussion about this

ravi-kumar-pilla commented 2 months ago

Hi @astrojuanlu , Thank you for the suggestions. I tested with the tools you have mentioned and also prepared a rough notes on the next steps here.

To summarize, as a first step, if we load kedro data in an async way (async loading test branch) would help improve the Kedro-Viz load time for larger pipelines. If there are any new findings on the internal implementation of Kedro, I would be happy to discuss in the next Tech design.

Thank you

astrojuanlu commented 1 month ago

Thanks @ravi-kumar-pilla. To summarize from the internal document:

Insights

Next steps

  1. Stress test with https://github.com/noklam/kedro-example/tree/master/stress-test-pipeline and summarize the results
  2. Check for internals of _get_catalog() and pipelines to further optimize

And if I may add, I think

astrojuanlu commented 1 month ago

Adding a bit more context after a quick discussion: