Metadata for data tracing

initze / thaw-slump-segmentation

MIT License

10 stars 4 forks source link

Metadata for data tracing #1

Open khdlr opened 3 years ago

khdlr commented 3 years ago

For a better tracability of the output data I would propose to introduce some kind of metadata for the inference datasets. some kind of machine-readable format would be ok. Maybe I can check out some standards for that.

This will include.

model + checkpoint
timestamp
processing version (info from git? last commit, tag?)

Feel free to propose some more.

khdlr commented 3 years ago

Agreed, this is a very good idea. Regarding the "processing version", we will usually have two (or even three) steps where code versions can in theory differ:

Preprocessing
Training
Inference

So maybe, we could have "pipeline metadata" not only for the final inference results, but also for the things in-between. Like we already log the used config-file when training, we could also log the current git-HEAD when preprocessing the data and training, so that for the final inference we can retrace e.g.:

preprocessing:
  date: 2020-09-17
  git: 3e01b17d90e357a42f47e9cbc3495086fefc9a50
training:
  date: 2020-09-31
  git: fe1451743b0a36e8d585d8851631c80ed8eb39df
inference:
  date: 2020-10-19
  git: cf7c607470f0ff482e605ded59a6172c41660230

Or is that overkill? It would also enable us to e.g. issue warnings when training with incompatible datasets and fun stuff like that :)

khdlr commented 3 years ago

[Nitze, Ingmar] That's pretty detailed and a good idea :), we have thousands of files anyway, one more metadatafile per dataset doesn't make a big difference.

So we may have tracking files for

(preprocessed) Datasets
- version: setup_raw_data.py
- version: perpare_data.py
- timestamp? (better than date alone?)
- other?
inference datasets
- model + checkpoint
- timestamp

We discussed data tracability in one of my other projects. I have contact to the PI of the Arctic Data Center (data repository), and one of their main topics is exactly that.

Let's keep this topic open and we can have an implementation ready for an upcoming version, which also relates to #38 :)

khdlr commented 3 years ago

Exactly, we can always throw away data we don't need later :smile:

Timestamps are better than just the date, agreed.

One more thing that comes to mind: Even when we log the current git commit, the actual code can of course differ from the code in the current git commit. There is a gitpython package that would allow us to check this (e.g. warn the user if the git-diff is not empty)