Open rashidakanchwala opened 2 years ago
@jmholzer recently did this https://github.com/kedro-org/kedro/pull/1795#issuecomment-1232014807 where he tested the runner with 1000 nodes. I am wondering if we can create a json from the pipeline with 1000 nodes and use it for the above.
Great idea. Let's try and build this into the demo project so we don't have maintain two data sources.
Thoughts from backlog grooming.
Another idea: find a team that has a massive pipeline and get it from them.
I know a few of them 😄
Please let us know where we can get one!
We will use the insurex (QB vertical team) sanitized pipeline for this.
Hi Team,
Update:
I reached out to Shubham from CommercialX and got one of their pipeline. He also shared a box link to go over the setup. I have set it up in my local and kedro viz run
seems to load pretty normally. Though I had to comment out the Spark session initialization step.
Observations:
I would like to get some help from the framework team (@SajidAlamQB , @ankatiyar if anyone has some time), to speed the process of Spark setup locally and successfully execute kedro run
.
Thank you
CommercialX Kedro Viz Testing -
Observations:
dict(pipelines)
takes 50% of the time to start the serverSize of the data -
RUN 1 -
Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 2.6968612670898438
Time taken to create a kedro session:: 0.44796109199523926
[04/24/24 19:43:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(
Time taken to create a kedro context:: 0.12806415557861328
Time taken to create a kedro session store:: 9.5367431640625e-07
Time taken to create a kedro catalog:: 15.315791845321655
[04/24/24 19:44:31] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(
Time taken to create pipeline dictionary:: 23.553779125213623
Time taken to create stats dictionary:: 7.510185241699219e-05
Time taken to load kedro project data:: 42.1427047252655
Time taken to populate pipelines:: 9.5367431640625e-07
[04/24/24 19:44:33] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
[04/24/24 19:44:34] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.3385379314422607
Time taken to start uvicorn server:: 43.49144387245178
Kedro Viz started successfully.
RUN 2 -
Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 1.7348659038543701
Time taken to create a kedro session:: 0.2879657745361328
[04/24/24 19:59:22] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(
Time taken to create a kedro context:: 0.12883210182189941
Time taken to create a kedro session store:: 0.0
Time taken to create a kedro catalog:: 13.26403284072876
[04/24/24 19:59:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(
Time taken to create pipeline dictionary:: 21.121844053268433
Time taken to create stats dictionary:: 6.508827209472656e-05
Time taken to load kedro project data:: 36.5377631187439
Time taken to populate pipelines:: 1.1920928955078125e-06
[04/24/24 19:59:57] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.4388270378112793
Time taken to start uvicorn server:: 37.98678135871887
Kedro Viz started successfully.
Immediate RUN 3 -
Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 1.6473729610443115
Time taken to create a kedro session:: 0.2387540340423584
[04/24/24 20:01:57] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(
Time taken to create a kedro context:: 0.12455415725708008
Time taken to create a kedro session store:: 9.5367431640625e-07
Time taken to create a kedro catalog:: 9.044120073318481
[04/24/24 20:02:15] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(
Time taken to create pipeline dictionary:: 9.573238134384155
Time taken to create stats dictionary:: 4.982948303222656e-05
Time taken to load kedro project data:: 20.628222227096558
Time taken to populate pipelines:: 9.5367431640625e-07
[04/24/24 20:02:16] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
[04/24/24 20:02:17] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list
in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.3532860279083252
Time taken to start uvicorn server:: 21.99152898788452
Kedro Viz started successfully.
- Populating piplines dict(pipelines) takes 50% of the time to start the server
- Kedro Catalog creation takes up considerable time as well
Good to know. What are the next steps?
The logs are a bit difficult to read. Maybe it would help to see a flamegraph, like this https://github.com/kedro-org/kedro/issues/3033#issue-1895014637
Also notice that, while testing with internal projects is useful, for us to confidently move forward with this we will probably have to generate some open source synthetic projects to test. See https://github.com/kedro-org/kedro/discussions/3790 for past discussion about this
Hi @astrojuanlu , Thank you for the suggestions. I tested with the tools you have mentioned and also prepared a rough notes on the next steps here.
To summarize, as a first step, if we load kedro data in an async way (async loading test branch) would help improve the Kedro-Viz load time for larger pipelines. If there are any new findings on the internal implementation of Kedro, I would be happy to discuss in the next Tech design.
Thank you
Thanks @ravi-kumar-pilla. To summarize from the internal document:
kedro viz run
command (already sort of known, https://github.com/kedro-org/kedro/issues/1476)pipelines_dict
resolution, which worsens as the pipeline count increases_get_catalog()
and pipelines to further optimizeAnd if I may add, I think
Adding a bit more context after a quick discussion:
Description
Create a massive kedro-viz pipeline to stress-test flowchart features.
Context
The fluidity of flowchart interactions depends on the size of the pipeline, currently we don't have massive pipelines so we cannot stress tests a lot of features on kedro-viz. We know a lot of data science projects have huge pipelines. This issue is to make sure we build kedro-viz to also handle massive pipelines.
Possible Implementation
Maybe we can just create a big json file with multiple large pipelines
Checklist