Open merelcht opened 1 year ago
I would be in favor of custom version class, like in settings like many other conf at the moment. That is the only section i saw kedro wasn't compatible with our own internal tooling. Would be great to see this in action!
A user asks whether there's a way to timestamp datasets according to when the kedro run
is launched and not when the dataset is written, or in other words, hardcode the timestamp https://linen-slack.kedro.org/t/16016262/dear-kedro-team-is-there-a-canonical-kedro-way-to-timestamp-#24683f38-88ce-45c3-93b7-ca56ea4e0508
xref #2694 and potentially #1731
A user asks whether there's a way to timestamp datasets according to when the kedro run is launched and not when the dataset is written, or in other words, hardcode the timestamp https://linen-slack.kedro.org/t/16016262/dear-kedro-team-is-there-a-canonical-kedro-way-to-timestamp-#24683f38-88ce-45c3-93b7-ca56ea4e0508
I'd agree that this would be a very nice improvement, we would generally prefer all the timestamps for any outputs be the timestamp of the initial run command.
Perhaps this isn't the right thread but I'd like to inject that it'd be nice to include the possibility to not strictly track versions based on the date. Right now my team has been discussing wanting to be able to organize versions by some sort of unique short identifier e.g. akin to git short hashes or unique word phrases instead of the dates in the filename. I've been reading through the Kedro source pondering how to go about this but not having much luck so far.
I've been reading through the Kedro source pondering how to go about this but not having much luck so far.
It's mostly here:
and you'll see it's hardcoded to generate a timestamp:
Allowing for custom versions would open the gates for a Kedro + DVC integration https://github.com/kedro-org/kedro/issues/2691 lots of people have asked about this.
Had an idea today and got quite close to being able to configure the versioning using custom resolvers.
# settings.py
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"base_env": "base",
"default_run_env": "local",
"custom_resolvers": {
"now": lambda: dt.datetime.now().strftime("%Y-%m-%dT%H.%M.%S.%fZ")
}
}
# datasets.py
class SimpleCSVPolarsDataset(AbstractDataset):
def __init__(self, filepath: str):
self._filepath = filepath
def _load(self) -> pl.DataFrame:
return pl.read_csv(self._filepath)
def _save(self, data: pl.DataFrame) -> None:
data.write_csv(self._filepath)
def _describe(self) -> dict[str, Any]:
return {"filepath": self._filepath}
# catalog.yml
test_csv_dataset:
type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
filepath: data/02_intermediate/pypi_kedro_demo_${now:}.csv
Usage:
In [1]: from kedro.io import DataCatalog
...: from kedro.config import OmegaConfigLoader
...:
...: import polars as pl
...:
...: config_loader = OmegaConfigLoader(
...: conf_source="conf",
...: base_env="base",
...: default_run_env="local",
...: )
...: catalog = DataCatalog.from_config(config_loader.get("catalog"))
In [2]: df = pl.DataFrame(...)
In [3]: catalog.save("test_csv_dataset", df)
[07/31/24 14:56:38] INFO Saving data to test_csv_dataset (SimpleCSVPolarsDataset)...
In [4]: !tree data/
data/
└── 02_intermediate
└── pypi_kedro_demo_2024-07-31T14.56.33.314040Z.csv
Caveats:
_load
themMetricsDataset
famously doesn't even have _load
runtime_params
test_csv_dataset:
type: kedro_pypi_monitor.datasets.SimpleCSVPolarsDataset
filepath: data/02_intermediate/pypi_kedro_demo_${now:${runtime_params:test_csv_dataset,''}}.csv
_save
method itselfAt this point - shouldn't we just push people towards formats like Iceberg?
I should say your solution is neat and elegant... but do we need to expose this to the user?
At this point - shouldn't we just push people towards formats like Iceberg?
I've found that Data Scientists tend to prefer this no-frills versioning, many folks don't even set up something like MLflow for their local experiment tracking.
OTOH, Delta and Iceberg are perfectly supported through Polars, probably Pandas too. So the option already exists, I've been documenting it in my talks & workshops, and it's a matter of adding that to the docs.
The point is (and this is something @iamelijahko is researching at the moment): given that Delta & Iceberg exist (with versioning, time travel, automatic garbage collection) and also that low-complexity, filename-based versioning is possible with OmegaConf resolvers, do we want to additionally keep maintaining our AbstractVersionedDataset
s?
I'm for anything that removes the AbstractVersionedDataset
, I guess the flip side - if we delegate the versioning to some other technology how do we standardise the kedro run --load-versions=<dataset_name>:YYYY-MM-DDThh.mm.ss.sssZ
functionality?
I actually learned that kedro run --load-versions
was a thing just yesterday while I was writing this comment. Wondering how many people use it. Will have a look at our telemetry.
Yeah I suspect if you use versioning in the first place you either need this or Datacatalog.load({name}, version=...)
to actually interrogate your work. A very quick scan shows it's baked into some of the deepest bits of Kedro:
A user asked about this exact approach https://kedro.hall.community/support-lY6wDVhxGXNY/pushing-files-to-s3-with-dynamic-names-FfCYxXyxTZF4
Converted PR https://github.com/kedro-org/kedro/pull/1871 into this issue, to continue the discussion after the PR is closed.
Description
This PR aims to add more customization for
VersionedDataSet
s. There are three main additions made in this PR, the custom format versioning, the customizable version class, and the partial timestamp parsing.Motivation
Because Kedro can only versionate datasets using a predefined path, the data history structure generated by a previous code that wasn't using Kedro would require to be unnecessarily refactored. Because of that, I tried another approach using
PartitionedDataSets
, but its logic is not only hard to maintain but is syntactically different than Kedro's declarative YAML idea. For this reason, I wrote this PR to help turning this need into a feature.Custom format
The first addition enables the use of format codes in the filepath in order to change the default target path of the versioned file.
The example above dataset would have been translated to
data/01_raw/company/car_data/2022/09/25/car_data.csv
if today's date was2022-09-25
Partial timestamp
In order to simplify loading custom versioned datasets, inputting a not fully filled timestamp has also been implemented.
or
This is now a possible way of selecting the load version.
Custom version class
If the custom date format is not enough to implement the versioning logic, then the user can subclass the
Version
class in order to override the default parse and unparse behaviour of the timestamps. For example, let's say you want to represent the day as the Sunday of the week every time you run the code. For that, you could do something like this:Development notes
Version class
Instead of using
Version
as a namedtuple,Version
is now a complete class that helps to parse and to unparse filepaths, becoming the former part ofAbstractDataSet
that processes timestamps into paths. This was developed for enabling the custom version manipulation logic.Kept the original behaviour
The default versioning behaviour was kept using the new auxiliar methods
is_custom_format
andis_unique_date_format
of theAbstractDataSet
UnknownDateTime
This class was implemented because of the mocks ['first', 'second'] in unit tests. I'm not sure if these non-timestamp formats were only designed for testing or if they are actual features. If it is only used for testing, this class and its handling logic in
Version._safe_parse
method can be removed, but the unit tests may need to be changed.Custom
Version
class demands paying attentionEven though a custom
Version
class can be specified, itsparsing
,unparsing
, andglob
methods must be implemented safely in order to not break the internal versioning logic. For instance, the example described before would be considered unique byis_unique_date_format
if it implements all ISO format codes. However, because it has changed the%d
behaviour, it shouldn't be considered unique. There is a workaround for this problem in the docs, but this is something the user has to pay attention. Also, because unparsing is called multiple times inside the code, the pattern can't be easily manipulated. For example, if the user wants the unparse to always add the date at the end of the filepath the user has to be careful in order to not add it multiple times (because of the internal logic). These are some examples of this setting limitation.Unit tests
Wrote unit tests for all
kedro.io.date_time
classes, and their methods aiming to reproduce their caller's expectations present in other parts of the code.Wrote unit tests in
test_data_catalog
for testing new warnings and if the files created by datasets using custom versioning were loading and saving correctly.None of the already present tests were changed in order to make sure the default behaviour was preserved.