kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.03k stars 906 forks source link

Inconsistency when setting version via `versioned` flag and dataset parameter #4326

Open ElenaKhaustova opened 1 week ago

ElenaKhaustova commented 1 week ago

Description

Currently, we have several options to mark dataset as versioned.

Option 1 - set versioned: true via configuration

regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor.pickle
  versioned: true

Option 2 - pass version object to dataset constructor

version = Version(
    load="load_version.csv",  # load exact version
    save="save_version.csv",  # save to exact version
)

test_dataset = ExcelDataset(
    filepath="data/01_raw/shuttles.xlsx", load_args={"engine": "openpyxl"}, version=version
)

Out KedroDataCatalog.from_config method allow to pass load_versions and save_versions: https://github.com/kedro-org/kedro/blob/075d59b1776c585698c677ec3619bc30b15ea8bc/kedro/io/kedro_data_catalog.py#L267

@classmethod
    def from_config(
        cls,
        catalog: dict[str, dict[str, Any]] | None,
        credentials: dict[str, dict[str, Any]] | None = None,
        load_versions: dict[str, str] | None = None,
        save_version: str | None = None,
    ) -> KedroDataCatalog:

However, the condition required to set version is versioned flag set to True: https://github.com/kedro-org/kedro/blob/075d59b1776c585698c677ec3619bc30b15ea8bc/kedro/io/core.py#L542, otherwise passed load and save versions are ignored.

So we have Option 3 to set the version via KedroDataCatalog.from_config and for that both versioned flag and load_versions/save_version should be set.

Context

  1. First of all it's very confusing since we have three different ways of setting version.
  2. load_versions/save_version parameters are ignored when creating catalog via KedroDataCatalog.from_config if versioned flag is not set.
  3. It's impossible to set versioned flag for dataset object and set load_versions/save_version via config.
  4. Some datasets are setting versioned flag when , but most - don't: https://github.com/kedro-org/kedro/blob/075d59b1776c585698c677ec3619bc30b15ea8bc/kedro/io/cached_dataset.py#L89
  5. The above problem introduces corner cases that make it harder to implement the serialization/deserialization feature https://github.com/kedro-org/kedro/issues/3932 for the catalog.

Possible Implementation

  1. Consider removing versioned flag
  2. Allow setting a version based on load_versions or/and save_version provided

Possible Alternatives

Make only step two as a temporal solution without breaking change.

ElenaKhaustova commented 1 week ago

We currently solved the problem for https://github.com/kedro-org/kedro/issues/4329 by adding logic to update VERSIONED_FLAG_KEY if version is provided.

ElenaKhaustova commented 1 week ago

We keep this issue open until we decide whether we want to fix it within Dataset Versioning workstream.