kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Spike: design example kedro projects that can be used to assess performance issues #3957

Open merelcht opened 2 weeks ago

merelcht commented 2 weeks ago

Description

Prework for #3866

Context

In order to create example kedro project that can be used to assess performance of Kedro and Kedro-Viz, we need to gather requirements of what defines complex pipelines. Some of the moving parts are number of nodes, number of pipelines and number of datasets, but that might not be all that's required to create a proper "family" of test projects.

Possible Implementation

Good starting point: https://github.com/noklam/kedro-example/blob/master/stress-test-pipeline/src/stress_test_pipeline/pipeline.py

datajoely commented 2 weeks ago

Heavy dependency imports would be great here too

merelcht commented 2 weeks ago

Heavy dependency imports would be great here too

Core dependencies of Kedro or just any?

datajoely commented 2 weeks ago

Sorry I meant things like Pytorch / Tensorflow / Spark / Pandas