Environment Forking for Flexible Data Source Configuration

We're considering adding a feature to allow more flexible configuration of data sources across environments. The primary use case is to enable testing part of the pipeline using production data without needing to copy data manually. Thought I'd share here to see if others find this useful as well.

Proposed Features:

Environment Forking Flag: Example: kedro run --from-nodes a,b,c --fork-from prod --env dev This would read initial datasets from the 'prod' environment and then execute the rest of the pipeline in the 'dev' environment.
Dataset Copying Command: Example: kedro copy --datasets a,b,c --from prod --to dev This would manually copy specified datasets from 'prod' to 'dev' environment before running the pipeline.
Inverse Tag Filtering: Example: kedro run --without-tags tag1,tag2 This would filter out nodes based on tags, inverse of the existing --tags option.

Use Case:

Allow developers to run nodes X>Y>Z with real production data on their own machines.
Initial reads come from the production environment, but intermediate data is stored in the developer's environment.
Helps maintain data consistency across team members without overwriting each other's data or requiring manual copying of intermediate results.

Current Limitations:

Data needs to be copied between environments manually to run "on prod data but in dev env"
Difficulty in working with the latest production data without interfering with other developers' work.

Potential Implementation:

Extend the CLI or implement a hook to support these features.

Long-term Consideration:

Explore the possibility of implementing a more comprehensive "virtual data environment" solution similar to https://tobikodata.com/virtual-data-environments.html

kedro-org / kedro

Environment Forking for Flexible Data Source Configuration #4076

Proposed Features: