Closed antonymilne closed 2 years ago
This comment in Kedro-Viz makes me think that @limdauto was anticipating that run_id should not necessarily be the same as the timestamp?
Also from this Discord conversation with @shaunc on possible integration with DVC, my gut feeling is that saying run_id = session_id = save_version is maybe too restrictive and we should allow for controlling some (all?) of these independently.
I'm integrating kedro with DVC experiment tracking kedro-dvc (integration plans) -- I'm hoping that the kedro interface will support different experiment tracking and session management plugins.
Its great that you support SESSION_STORE_CLASS
-- though I'm hoping for more detail on its interface! :)
VS ids -- I'd propose that the session store "rows" (metadata) include various fields, whose names and meanings can also be configurable. For instance:
SESSION_ID_FIELD
- field guaranteed to be unique over session storeSESSION_TIMESTAMP_FIELD
and/or SESSION_ORDER_BY_FIELD
-- the first for a timestamp, the last for display order in kedro-vis, and for finding the most recent, and maybe for deleting old during garbage collection.DVC uses git commit hashes for names. A timestamp isn't necessarily unique in distributed runs. However, you could have default config for these things all pointing to the timestamp field you are already using for convenience.
The other thing I'd like to see made abstract is RunsRepository
. Is this also going to migrate to kedro core? May I suggest that both this and the default session metadata store be moved to -- say -- kedro-session
plugin, which is included by default by the core, but, being a plugin, could be superseded by someone who, perhaps, wanted to use kedro-dvc
instead? :)
Based on the discussion the Kedro team had on this topic on Monday it was decided that a session can only every have 1 run, and so the run_id
is no longer needed. For the time being, the session_id
and save_version
will remain the same, but there is a possibility to allow users to add a custom save_version
that is different from the session_id
. We'll require user research to determine the best solution for allowing this customisation.
See more details in https://github.com/kedro-org/kedro/issues/1335
All tasks have been completed.
(transfer from Jira, created by @lorenabalan)
This may be acceptable behaviour, but we should document it better in that case.
Working on this PR (see discussion in the thread too), I discovered that in the new session model, users can't set their own custom
run_id
, or use a different way to generate a save version (for example use a different format). They can modifyKedroContext._get_save_version()
orKedroContext._get_run_id()
but that's not what will be stored in the session store - instead it will contain only the values equivalent tosession_id
.When a session is created, a
session_id
is generated (timestamp) and written to the store. During a session run, thatsession_id
is loaded from the store and is used for bothsave_version
andrun_id
, and it'll be the same timestamp every time, i.e.1 session ↔️ 1 run ↔️ 1 run_id=session_id=save_version
If that's the case, as Antony pointed out, what's the optimal behaviour in this case:
Note this isn't as strange a thing to do as it might initially seem. In Jupyter, a user could well do multiple
session.run
.This ticket includes both the design and implementing the solution. Setup a team discussion to go through design suggestion.
Following the discussion the team had about this issue these tasks need to be done:
After
0.18.0
has been released: