kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.83k stars 895 forks source link

Environment Forking for Flexible Data Source Configuration #4076

Open pascalwhoop opened 1 month ago

pascalwhoop commented 1 month ago

We're considering adding a feature to allow more flexible configuration of data sources across environments. The primary use case is to enable testing part of the pipeline using production data without needing to copy data manually. Thought I'd share here to see if others find this useful as well.

Proposed Features:

  1. Environment Forking Flag: Example: kedro run --from-nodes a,b,c --fork-from prod --env dev This would read initial datasets from the 'prod' environment and then execute the rest of the pipeline in the 'dev' environment.

  2. Dataset Copying Command: Example: kedro copy --datasets a,b,c --from prod --to dev This would manually copy specified datasets from 'prod' to 'dev' environment before running the pipeline.

  3. Inverse Tag Filtering: Example: kedro run --without-tags tag1,tag2 This would filter out nodes based on tags, inverse of the existing --tags option.

Use Case:

Current Limitations:

Potential Implementation:

Long-term Consideration:

SajidAlamQB commented 1 month ago

Hi @pascalwhoop, thanks for bringing these suggestion, they seem worthwhile exploring!