iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

fix dvc exp/git date format bug #141

Closed pmrowla closed 1 year ago

pmrowla commented 1 year ago

The generate scripts set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE using unix timestamps but do not include a timezone offset. DVC uses git libraries (dulwich, libgit2/pygit) that conform to the git spec:

It is <unix-timestamp> <time-zone-offset>, where <unix-timestamp> is the number of seconds since the UNIX epoch. <time-zone-offset> is a positive or negative offset from UTC. For example CET (which is 1 hour ahead of UTC) is +0100.

https://git-scm.com/docs/git-commit#_date_formats

This causes dvc exp run to fail in the generate scripts from being unable to generate git commits (due to the invalid date format). This issue did not come up before because CLI git makes the assumption that the user forgot to include a timezone offset (even though the spec requires it) and falls back to some default instead.

Also, the scripts do not unset the date variables before using exp run, which ends up causing every exp commit to have the same timestamp (since exp run cannot be forced to use the tick() functionality from the generation scripts for generating fake timestamps).

This PR:

Technically only the unset is actually required here, but we should really be setting the env vars properly even if CLI git handles the invalid timestamps

pmrowla commented 1 year ago

Also, if we really care about preserving the fake timestamps, exps should always be generated individually using the tick() script method to control commit timestamps like

tick
dvc exp run --set-param ...

tick
dvc exp run --set-param ...

Currently the generate scripts always queue up several exps and then run them all at once via the queue, so individual commit timestamps cannot be controlled unless we set them all to the same timestamp.

Should also note that in the event we ever generate checkpoint example repos, we could not control individual checkpoint timestamps this way (at best we could have identical timestamps within a single checkpoint run, but with different controlled timestamps for separate checkpoint runs).

shcheklein commented 1 year ago

thanks @pmrowla , good catch! I've seen the followup to the SCM repo - is it about improving the error message for such cases?

pmrowla commented 1 year ago

thanks @pmrowla , good catch! I've seen the followup to the SCM repo - is it about improving the error message for such cases?

Yes.