iterative / dvc-checkpoints-mnist

Example of checkpoints in a DVC project training a simple convolutional neural net to classify MNIST data
5 stars 5 forks source link

handle KeyboardInterrupt cleanly #2

Closed pmrowla closed 3 years ago

pmrowla commented 3 years ago

Example should wrap the actual loop in try/except blocks to handle ctrl-c properly. Without the exception handling resuming via dvc exp run will not behave es expected (it will create "new" runs rather than properly extending the run which was killed via ctrl-c)

dvc exp show --no-pager                                                                             ⏎
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━┓
┃ Experiment            ┃ Created  ┃ step ┃    acc ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━┩
│ workspace             │ -        │   20 │ 0.9836 │
│ live                  │ 03:04 AM │    - │      - │
│ │ ╓ exp-60572         │ 05:10 PM │   20 │ 0.9836 │
│ │ ╟ 1234c39           │ 05:10 PM │   19 │ 0.9838 │
│ │ ╟ ebe6885           │ 05:10 PM │   18 │ 0.9831 │
│ │ ╟ 28cb5d0           │ 04:56 PM │   17 │ 0.9827 │
│ │ ╟ a171e56           │ 04:56 PM │   16 │ 0.9843 │
│ │ ╟ 6d5ea4d           │ 04:56 PM │   15 │ 0.9833 │
│ │ ╟ b5d2afc           │ 04:56 PM │   14 │ 0.9829 │
│ │ ╟ 3565bef           │ 04:55 PM │   13 │ 0.9828 │
│ │ ╟ 8f7d86a           │ 04:55 PM │   12 │ 0.9817 │
│ │ ╟ ff738b3           │ 04:55 PM │   11 │ 0.9776 │
│ │ ╟ d796a99           │ 04:55 PM │   10 │ 0.9765 │
│ │ ╟ 16928a6           │ 04:55 PM │    9 │ 0.9794 │
│ │ ╟ 780ff04 (47bddc3) │ 04:55 PM │    8 │ 0.9729 │
│ │ ╓ exp-b1d87         │ 04:54 PM │    7 │  0.976 │
│ │ ╟ fe7f0f2 (d6e50f1) │ 04:53 PM │    6 │ 0.9651 │
│ │ ╓ exp-bea83         │ 04:53 PM │    5 │ 0.9568 │
│ │ ╟ 2b530c6           │ 04:53 PM │    4 │ 0.9592 │
│ │ ╟ 713a9d5 (a87bc18) │ 04:53 PM │    3 │ 0.9436 │
│ │ ╓ exp-2e840         │ 04:52 PM │    2 │ 0.9198 │
│ │ ╟ 51f5d90           │ 04:52 PM │    1 │ 0.9093 │
│ ├─╨ c6bb3db           │ 04:52 PM │    0 │ 0.8577 │
└───────────────────────┴──────────┴──────┴────────┘

In this example, the separate experiments at the bottom happen because the ctrl-c is not handled properly, the continguous run at the top of the table was several ctrl-c + resumed runs after this change

dberenbaum commented 3 years ago

Weird, I can't seem to reproduce on my end. Does the interruption have to be timed to a specific point in the code to trigger the issue? I'm assuming it's not specific to the live branch?

Is it expected behavior? In https://dvc.org/doc/command-reference/exp/run#checkpoints, this workflow is documented and doesn't mention error handling. The code example in https://dvc.org/doc/api-reference/make_checkpoint also doesn't have this.

pmrowla commented 3 years ago

It is documented already

If the process gets interrupted (e.g. with [Ctrl] C or by an error), all the checkpoints so far will be preserved. When a run finishes normally, a final checkpoint will be added (if needed) to wrap up the experiment.

The difference is that without the proper exception handling, your process will exit with an error code (rather than finishing normally). And it does depend on when you use ctrl-c (specifically, it depends on whether or not your process wrote any new changes to the workspace before you hit ctrl-c)

Any changes in your workspace from between the last generated checkpoint and the point at which you forcefully killed your process are left in your workspace (as untracked/uncommitted changes). The next time you do exp run, DVC now sees that as modifications to the last generated checkpoint, and will start a "new" experiment branch.

If you handle the ctrl-c properly, your process will always exit gracefully without an error code, so DVC will create the final commit to save the last workspace state.

The code example in https://dvc.org/doc/api-reference/make_checkpoint also doesn't have this.

This example is intentionally simplistic.

If the user is writing their process in a way that they expect to be using ctrl-c as a way to gracefully stop their process, they should handle it properly. If they want ctrl-c to make their program exit and return an error code, then they don't need any extra exception handling.

DVC just handles whatever the user's process does appropriately (i.e. if it exits with an error code, DVC treats it as an error).

dberenbaum commented 3 years ago

Right, I didn't think about changes made after the last checkpoint I guess. I see a few follow-ups then:

  1. Apply the same changes to other branches of this repo.
  2. Discuss whether this needs to be documented more clearly.
shcheklein commented 3 years ago

Discuss whether this needs to be documented more clearly.

💯 I would expect this to be put as a recommendation somewhere