kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.95k stars 903 forks source link

[Stress Testing] - Data Catalog and Config Loader #4125

Closed noklam closed 2 weeks ago

noklam commented 2 months ago

Description

To create stress test for individual components:

As for DataCatalog, the most important thing is to test it within the pipeline, CLI and separately by simulating scenarios when calling some methods as (add_feed_dict). Where the tests themselves should include different sets and combinations of parameters, datasets and patterns. https://github.com/kedro-org/kedro/issues/3957#issuecomment-2299105368

The tickets combined the two components as I believe the setup will be similar, or at least there need to be some mechanism to setup different datasets first so I thought it makes sense to bundle these two components together.

Context

https://github.com/kedro-org/kedro/issues/3957#issuecomment-2298861806

Component stress test:

  • The main goal for this is to benchmark performance of individual component, this will inform if refactoring work has positive/negative impact. Currently we only check if test pass, so we have no idea if a change may slow down performance. We have done this in the past but usually ad-hoc basis, we should run this regularly (or at least per release).

The direction of this is simple, we want to make measure the change of time against # number of entries. We would start with Datasets and Catalog, as this fits in the DataCatalog2.0 work and will be immediately useful.

  • DataCatalog (test # of datasets with catalog.yml & dataset factory)
  • ConfigLoader(# of parameters)
  • Optional: pipelines generated in loops (Dynamic pipeline)

This can address:

astrojuanlu commented 1 month ago

xref https://github.com/kedro-org/kedro/pull/4154

ankatiyar commented 2 weeks ago

Keeping this open for performance tests for KedroDataCatalog