Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
We're considering adding a feature to allow more flexible configuration of data sources across environments. The primary use case is to enable testing part of the pipeline using production data without needing to copy data manually. Thought I'd share here to see if others find this useful as well.
Proposed Features:
Environment Forking Flag:
Example: kedro run --from-nodes a,b,c --fork-from prod --env dev
This would read initial datasets from the 'prod' environment and then execute the rest of the pipeline in the 'dev' environment.
Dataset Copying Command:
Example: kedro copy --datasets a,b,c --from prod --to dev
This would manually copy specified datasets from 'prod' to 'dev' environment before running the pipeline.
Inverse Tag Filtering:
Example: kedro run --without-tags tag1,tag2
This would filter out nodes based on tags, inverse of the existing --tags option.
Use Case:
Allow developers to run nodes X>Y>Z with real production data on their own machines.
Initial reads come from the production environment, but intermediate data is stored in the developer's environment.
Helps maintain data consistency across team members without overwriting each other's data or requiring manual copying of intermediate results.
Current Limitations:
Data needs to be copied between environments manually to run "on prod data but in dev env"
Difficulty in working with the latest production data without interfering with other developers' work.
Potential Implementation:
Extend the CLI or implement a hook to support these features.
We're considering adding a feature to allow more flexible configuration of data sources across environments. The primary use case is to enable testing part of the pipeline using production data without needing to copy data manually. Thought I'd share here to see if others find this useful as well.
Proposed Features:
Environment Forking Flag: Example:
kedro run --from-nodes a,b,c --fork-from prod --env dev
This would read initial datasets from the 'prod' environment and then execute the rest of the pipeline in the 'dev' environment.Dataset Copying Command: Example:
kedro copy --datasets a,b,c --from prod --to dev
This would manually copy specified datasets from 'prod' to 'dev' environment before running the pipeline.Inverse Tag Filtering: Example:
kedro run --without-tags tag1,tag2
This would filter out nodes based on tags, inverse of the existing--tags
option.Use Case:
Current Limitations:
Potential Implementation:
Long-term Consideration: