Open noklam opened 2 years ago
FYI I think it's already possible to achieve this since save_version
and load_versions
are available in after_catalog_created
. So you can shift that information into the right after_dataset_loaded
etc. hooks and recreate the load_version
logic by doing something like load_version = load_versions.get("dataset_name", save_version)
.
That's obviously not a good solution though, and it would be nice to have a better way of doing it. Currently we expose dataset_name
and data
in the dataset hooks. If we want to add load/save_version
then I wonder whether we should just expose the whole catalog.datasets[dataset_name]
. I've seen people asking for filepath
in these hooks before, so it might be best just to make the whole thing available rather than adding more and more arguments piecemeal. (Like the discussion about whether we should expose context
or config_loader
in a hook)
Is the load version information when load_version = None where it would just look up the latest version?
Yeah, exactly. From memory that's what load_version
does? I could be wrong though...
Edit: I'm wrong. Looks like load_version
actually does a glob to look through files and pick out the latest version. That makes more sense than what I said before. Forget what I said 😀
That's what I found out, that information wasn't expose to the hook at all, what we got is None only. So I think it is almost impossible to implement a solution currently.
Sorry I have written a more detailed issue originally but Github project actually convert my issue to a blank page so I missed this when I recreated the issue😅
I think this information would be quite valuable to experiment tracking, data versioning is one of the key for reproducible experiment. If this info is stored in session store we can eventually reproduce an experiment with the session_id and extract all the dataset that it used exactly.
Notes from Technical Design session:
It was agreed that the code for loading the latest data/fetching the load version needs further refactoring. Inside kedro-viz
we have replicated some of the logic to load the latest data for experiment tracking, which should be improved as well. To be done in:
After the refactoring is done, we should looking into whether we should expose the load/save version information and how to best do that.
This wasn't completed in #1911, closed by mistake
(Created by Nok, converted from Discord Discussion)
Desciption
A user want to have dataset load/save version logged, potentially like this
This is not possible currently as kedro does not track this information, it should belongs to either
How current load version is determined when version=None?
Currently, this
load_version
information is buried deep down in the framework, and it is determined only when a dataset is loaded at runtime. The details of how "latest" version is in a method calledresolve_load_version
, which further calls_fetch_latest_load_version
Further Studies
For some reason,
resolve_load_version
is being called twice, L594 seems to be a leftover from historical refactoring, need further confirmation. https://github.com/kedro-org/kedro/blob/b2e59facaa5f97f693287be590b4b4d297db344e/kedro/io/core.py#L558-L564https://github.com/kedro-org/kedro/blob/b2e59facaa5f97f693287be590b4b4d297db344e/kedro/io/core.py#L594
The refactor PR is here: https://github.com/kedro-org/kedro/commit/f03226e29b8a018a0f6edab6d3f1a0d37c1b1812