kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Exposing load/save version in hooks or included these information in logs? #1580

Open noklam opened 2 years ago

noklam commented 2 years ago

(Created by Nok, converted from Discord Discussion)

Desciption

A user want to have dataset load/save version logged, potentially like this

2022-05-30 10:01:09,039 - kedro.io.data_catalog - INFO - Loading data from `batch` (PartitionedDataSet) (version: 2022-05-15T05.24.31.017Z)
2022-05-30 10:01:18,467 - kedro.io.data_catalog - INFO - Saving data to `processed_batch` (ParquetDataSet)... (version: 2022-05-30T08.23.54.197Z)

This is not possible currently as kedro does not track this information, it should belongs to either

  1. Extra hook spec (after_dataset_loaded hook)
  2. Session store (experiment tracking)

How current load version is determined when version=None?

Currently, this load_version information is buried deep down in the framework, and it is determined only when a dataset is loaded at runtime. The details of how "latest" version is in a method called resolve_load_version, which further calls _fetch_latest_load_version

Further Studies

For some reason, resolve_load_version is being called twice, L594 seems to be a leftover from historical refactoring, need further confirmation. https://github.com/kedro-org/kedro/blob/b2e59facaa5f97f693287be590b4b4d297db344e/kedro/io/core.py#L558-L564

https://github.com/kedro-org/kedro/blob/b2e59facaa5f97f693287be590b4b4d297db344e/kedro/io/core.py#L594

The refactor PR is here: https://github.com/kedro-org/kedro/commit/f03226e29b8a018a0f6edab6d3f1a0d37c1b1812

antonymilne commented 2 years ago

FYI I think it's already possible to achieve this since save_version and load_versions are available in after_catalog_created. So you can shift that information into the right after_dataset_loaded etc. hooks and recreate the load_version logic by doing something like load_version = load_versions.get("dataset_name", save_version).

That's obviously not a good solution though, and it would be nice to have a better way of doing it. Currently we expose dataset_name and data in the dataset hooks. If we want to add load/save_version then I wonder whether we should just expose the whole catalog.datasets[dataset_name]. I've seen people asking for filepath in these hooks before, so it might be best just to make the whole thing available rather than adding more and more arguments piecemeal. (Like the discussion about whether we should expose context or config_loader in a hook)

noklam commented 2 years ago

Is the load version information when load_version = None where it would just look up the latest version?

antonymilne commented 2 years ago

Yeah, exactly. From memory that's what load_version does? I could be wrong though...

Edit: I'm wrong. Looks like load_version actually does a glob to look through files and pick out the latest version. That makes more sense than what I said before. Forget what I said 😀

noklam commented 2 years ago

That's what I found out, that information wasn't expose to the hook at all, what we got is None only. So I think it is almost impossible to implement a solution currently.

noklam commented 2 years ago

Sorry I have written a more detailed issue originally but Github project actually convert my issue to a blank page so I missed this when I recreated the issue😅

I think this information would be quite valuable to experiment tracking, data versioning is one of the key for reproducible experiment. If this info is stored in session store we can eventually reproduce an experiment with the session_id and extract all the dataset that it used exactly.

merelcht commented 2 years ago

Notes from Technical Design session:

It was agreed that the code for loading the latest data/fetching the load version needs further refactoring. Inside kedro-viz we have replicated some of the logic to load the latest data for experiment tracking, which should be improved as well. To be done in:

After the refactoring is done, we should looking into whether we should expose the load/save version information and how to best do that.

noklam commented 1 year ago

This wasn't completed in #1911, closed by mistake