kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.94k stars 904 forks source link

Configurable versioning #2355

Open merelcht opened 1 year ago

merelcht commented 1 year ago

Converted PR https://github.com/kedro-org/kedro/pull/1871 into this issue, to continue the discussion after the PR is closed.

Description

This PR aims to add more customization for VersionedDataSets. There are three main additions made in this PR, the custom format versioning, the customizable version class, and the partial timestamp parsing.

Motivation

Because Kedro can only versionate datasets using a predefined path, the data history structure generated by a previous code that wasn't using Kedro would require to be unnecessarily refactored. Because of that, I tried another approach using PartitionedDataSets, but its logic is not only hard to maintain but is syntactically different than Kedro's declarative YAML idea. For this reason, I wrote this PR to help turning this need into a feature.

Custom format

The first addition enables the use of format codes in the filepath in order to change the default target path of the versioned file.

cars:
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/car_data/%Y/%m/%d/car_data.csv
  versioned: true

The example above dataset would have been translated to data/01_raw/company/car_data/2022/09/25/car_data.csv if today's date was 2022-09-25

Partial timestamp

In order to simplify loading custom versioned datasets, inputting a not fully filled timestamp has also been implemented.

kedro run --load-version "cars:2022-09-25"

or

catalog.load("cars", "2022-09-25")

This is now a possible way of selecting the load version.

Custom version class

If the custom date format is not enough to implement the versioning logic, then the user can subclass the Version class in order to override the default parse and unparse behaviour of the timestamps. For example, let's say you want to represent the day as the Sunday of the week every time you run the code. For that, you could do something like this:

# settings.py
# sunday_version.py
from kedro.io import Version, ProxyDateTime
from datetime import timedelta

class SundayVersion(Version):
    def tosunday(self, version: ProxyDateTime) -> ProxyDateTime:
        dt = version.datetime
        dt = dt - timedelta((dt.weekday() + 1) % 7)
        return ProxyDateTime.from_datetime(dt)

    def parse(self, version_str: str) -> ProxyDateTime:
        date_time = super().parse(version_str)
        return self.tosunday(date_time)

VERSION_CLASS = SundayVersion

Development notes

Version class

Instead of using Version as a namedtuple, Version is now a complete class that helps to parse and to unparse filepaths, becoming the former part of AbstractDataSet that processes timestamps into paths. This was developed for enabling the custom version manipulation logic.

Kept the original behaviour

The default versioning behaviour was kept using the new auxiliar methods is_custom_format and is_unique_date_format of the AbstractDataSet

UnknownDateTime

This class was implemented because of the mocks ['first', 'second'] in unit tests. I'm not sure if these non-timestamp formats were only designed for testing or if they are actual features. If it is only used for testing, this class and its handling logic in Version._safe_parse method can be removed, but the unit tests may need to be changed.

Custom Version class demands paying attention

Even though a custom Version class can be specified, its parsing, unparsing, and glob methods must be implemented safely in order to not break the internal versioning logic. For instance, the example described before would be considered unique by is_unique_date_format if it implements all ISO format codes. However, because it has changed the %d behaviour, it shouldn't be considered unique. There is a workaround for this problem in the docs, but this is something the user has to pay attention. Also, because unparsing is called multiple times inside the code, the pattern can't be easily manipulated. For example, if the user wants the unparse to always add the date at the end of the filepath the user has to be careful in order to not add it multiple times (because of the internal logic). These are some examples of this setting limitation.

Note: This customization of the datetime logic is very important for the use case I intend to use. I need the exact behaviour of the example, haha.

Unit tests

Wrote unit tests for all kedro.io.date_time classes, and their methods aiming to reproduce their caller's expectations present in other parts of the code.

Wrote unit tests in test_data_catalog for testing new warnings and if the files created by datasets using custom versioning were loading and saving correctly.

None of the already present tests were changed in order to make sure the default behaviour was preserved.

fazilhero commented 1 year ago

I would be in favor of custom version class, like in settings like many other conf at the moment. That is the only section i saw kedro wasn't compatible with our own internal tooling. Would be great to see this in action!

astrojuanlu commented 1 year ago

A user asks whether there's a way to timestamp datasets according to when the kedro run is launched and not when the dataset is written, or in other words, hardcode the timestamp https://linen-slack.kedro.org/t/16016262/dear-kedro-team-is-there-a-canonical-kedro-way-to-timestamp-#24683f38-88ce-45c3-93b7-ca56ea4e0508

xref #2694 and potentially #1731

jasonmhite commented 9 months ago

A user asks whether there's a way to timestamp datasets according to when the kedro run is launched and not when the dataset is written, or in other words, hardcode the timestamp https://linen-slack.kedro.org/t/16016262/dear-kedro-team-is-there-a-canonical-kedro-way-to-timestamp-#24683f38-88ce-45c3-93b7-ca56ea4e0508

I'd agree that this would be a very nice improvement, we would generally prefer all the timestamps for any outputs be the timestamp of the initial run command.

Perhaps this isn't the right thread but I'd like to inject that it'd be nice to include the possibility to not strictly track versions based on the date. Right now my team has been discussing wanting to be able to organize versions by some sort of unique short identifier e.g. akin to git short hashes or unique word phrases instead of the dates in the filename. I've been reading through the Kedro source pondering how to go about this but not having much luck so far.

astrojuanlu commented 9 months ago

I've been reading through the Kedro source pondering how to go about this but not having much luck so far.

It's mostly here:

https://github.com/kedro-org/kedro/blob/7384abd4074caeb3f5bb14a409f31c3f951ab9f1/kedro/io/core.py#L580-L586

and you'll see it's hardcoded to generate a timestamp:

https://github.com/kedro-org/kedro/blob/7384abd4074caeb3f5bb14a409f31c3f951ab9f1/kedro/io/core.py#L557-L562

astrojuanlu commented 9 months ago

Allowing for custom versions would open the gates for a Kedro + DVC integration https://github.com/kedro-org/kedro/issues/2691 lots of people have asked about this.

astrojuanlu commented 3 months ago

Had an idea today and got quite close to being able to configure the versioning using custom resolvers.

# settings.py
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
    "base_env": "base",
    "default_run_env": "local",
    "custom_resolvers": {
        "now": lambda: dt.datetime.now().strftime("%Y-%m-%dT%H.%M.%S.%fZ")
    }
}

# datasets.py
class SimpleCSVPolarsDataset(AbstractDataset):
    def __init__(self, filepath: str):
        self._filepath = filepath

    def _load(self) -> pl.DataFrame:
        return pl.read_csv(self._filepath)

    def _save(self, data: pl.DataFrame) -> None:
        data.write_csv(self._filepath)

    def _describe(self) -> dict[str, Any]:
        return {"filepath": self._filepath}
# catalog.yml
test_csv_dataset:
  type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
  filepath: data/02_intermediate/pypi_kedro_demo_${now:}.csv

Usage:

In [1]: from kedro.io import DataCatalog
   ...: from kedro.config import OmegaConfigLoader
   ...: 
   ...: import polars as pl
   ...: 
   ...: config_loader = OmegaConfigLoader(
   ...:     conf_source="conf",
   ...:     base_env="base",
   ...:     default_run_env="local",
   ...: )
   ...: catalog = DataCatalog.from_config(config_loader.get("catalog"))

In [2]: df = pl.DataFrame(...)

In [3]: catalog.save("test_csv_dataset", df)
[07/31/24 14:56:38] INFO     Saving data to test_csv_dataset (SimpleCSVPolarsDataset)...

In [4]: !tree data/
data/
└── 02_intermediate
    └── pypi_kedro_demo_2024-07-31T14.56.33.314040Z.csv

Caveats:

test_csv_dataset:
  type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
  filepath: data/02_intermediate/pypi_kedro_demo_${now:${runtime_params:test_csv_dataset,''}}.csv
datajoely commented 3 months ago

At this point - shouldn't we just push people towards formats like Iceberg?

datajoely commented 3 months ago

I should say your solution is neat and elegant... but do we need to expose this to the user?

astrojuanlu commented 3 months ago

At this point - shouldn't we just push people towards formats like Iceberg?

I've found that Data Scientists tend to prefer this no-frills versioning, many folks don't even set up something like MLflow for their local experiment tracking.

OTOH, Delta and Iceberg are perfectly supported through Polars, probably Pandas too. So the option already exists, I've been documenting it in my talks & workshops, and it's a matter of adding that to the docs.

The point is (and this is something @iamelijahko is researching at the moment): given that Delta & Iceberg exist (with versioning, time travel, automatic garbage collection) and also that low-complexity, filename-based versioning is possible with OmegaConf resolvers, do we want to additionally keep maintaining our AbstractVersionedDatasets?

datajoely commented 3 months ago

I'm for anything that removes the AbstractVersionedDataset, I guess the flip side - if we delegate the versioning to some other technology how do we standardise the kedro run --load-versions=<dataset_name>:YYYY-MM-DDThh.mm.ss.sssZ functionality?

astrojuanlu commented 3 months ago

I actually learned that kedro run --load-versions was a thing just yesterday while I was writing this comment. Wondering how many people use it. Will have a look at our telemetry.

datajoely commented 3 months ago

Yeah I suspect if you use versioning in the first place you either need this or Datacatalog.load({name}, version=...) to actually interrogate your work. A very quick scan shows it's baked into some of the deepest bits of Kedro:

astrojuanlu commented 1 month ago

A user asked about this exact approach https://kedro.hall.community/support-lY6wDVhxGXNY/pushing-files-to-s3-with-dynamic-names-FfCYxXyxTZF4