Closed daavoo closed 1 year ago
I think we need to move it to DVC tbh, it's a bigger discussion that we need to have and agree. We've discussed briefly with @daavoo today, and agreed that we at least want to discuss one thing that checkpoints give - ability to recover. For this we need either a replacement plan (outside), or simplify the existing logic somehow, or something else. It's worth summarizing all the potential benefits of the existing solution (if it's implemented correctly), again to make sure that we are not missing anything.
I agree that checkpoint: true
/dvc.api.make_checkpoint()
is not particularly useful for users, and that it's better to just provide the mechanism for a user's stage to generate an intermediate commit through the ML frameworks (which we already do with dvclive).
It seems like what we actually need in place of checkpoint: true
is proper support for circular dependencies (the existing checkpoint: true
/persist: true
flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)
Resumed experiments are not differentiable after persisting.
IMO this is a problem with the fact that exp apply; git add .; git commit
is the default suggested workflow for persisting an experiment. With exp branch
(or an alternative git checkout
based workflow) the git commits prior to resuming are available after persisting the experiment.
With
exp branch
(or an alternativegit checkout
based workflow) the git commits prior to resuming are available after persisting the experiment
My point was they would be equivalent at the level our UIs currently compare (except for the plots): Studio compare, exp/params diff, etc.
It seems like what we actually need in place of checkpoint: true is proper support for circular dependencies (the existing checkpoint: true/persist: true flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)
To give you more context, from the product perspective we have been discussing that the cherry picking use case of checkpoints is not valuable enough and we would like to think what could be the shortest path to enable resuming from an interrupted training.
So, putting the current implementation aside (al least initially to not biase discussing) and focusing on that use case.
In a workspace, single-experiment scenario, this already world by using persist
, a resume
param and letting DVCLiv and the framework the responsibility of handling the rest in the Python Code. The idea is that we don't carne about all iterations and we can assume that the current model on workspace would always be the one we want to resume.
Maybe we can start from that and thinking what are the Minimum chances needed to support recovering in other scenarios, like a remote instance that might get shutted down.
Is there agreement on both of these points?
For 2, I have been meaning to open https://github.com/iterative/dvclive/issues/505 for awhile now to discuss a possible solution.
Is there agreement on both of these points?
I can only speak for myself but I agree and https://github.com/iterative/dvclive/issues/505 looks like the direction I would prefer to go
Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.
Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.
Anything that DVCLive does/will do could be replicated by non-Python tools (background dvc push, Studio REST API calls)
Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show
or in the VS Code views).
Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show or in the VS Code views).
We also agreed that when I take on the work of integrating the new --json
format in #9170 I am going to gamble and not integrate checkpoints (unless necessary for some remaining feature). I'll scope and create a linked issue in vscode-dvc
for that work.
@pmrowla See above where @mattseddon is planning to drop checkpoints from the experiments table. We can do the same in the CLI table if it will save us maintenance time, and it might even be useful so that we can see if anyone complains.
Don't know what the use case yet, but yolo
repo uploads every checkpoint to its hub:
def on_model_save(trainer):
session = getattr(trainer, 'hub_session', None)
if session:
# Upload checkpoints with rate limiting
is_best = trainer.best_fitness == trainer.fitness
if time() - session.timers['ckpt'] > session.rate_limits['ckpt']:
LOGGER.info(f'{PREFIX}Uploading checkpoint {session.model_id}')
session.upload_model(trainer.epoch, trainer.last, is_best)
session.timers['ckpt'] = time() # reset timer
Originally posted by @shcheklein in https://github.com/iterative/dvc.org/issues/4415#issuecomment-1476721592
I don't really think we need a built-in replacement/solution in DVC to handle the checkpoints use case (which I am still unsure what that is).
People should handle interruption and resuming through the ML framework and DVC already provides convenient tools to wrap it (params,
persist
, run-cache).My main points about dropping checkpoints are:
The current solution doesn't provide value while coming at an important cost of code/docs maintenance.
Induces users into incorrect workflows and the required changes in the code are not properly explained anywhere.
Introduces ambiguous/unexpected behavior when executing more complex/realistic pipelines (i.e. how are downstream stages after
checkpoint: true
supposed to be executed?; what aboutforeach
ordvc.yaml
with more than 1 model?)As an example of the second point, here are the things that are "incorrect" in this repo (same applies to the example in https://dvc.org/doc/user-guide/experiment-management/checkpoints):
The
state_dict
of the optimizer should be also considered when loading/saving.I would dare to say that using a fixed learning rate would never result in a better model than using any kind of lr scheduler.
It would also require to be handled when loading/saving (which connects with the issues in the next point).
When picking a checkpoint and resuming from it, the
epochs
param is now treated asepochs + epochs_completed_at_checkpoint
which differs with the meaning when training without resuming whereepochs
params reflects the total number.Let's say we have a completed experiments that was using checkpoints:
If I run:
It is not possible to reproduce the experiment with a single command. We would have to run the exact combination of
exp apply
andexp run
. It is not possible to reproduce the experiment at all if the checkpoints are deleted.Let's say I have another experiment completed using checkpoints:
And I run:
Persisting this experiment or the one from the previous point will result in an equivalent state in the repo regarding
params
andstep
metric, even though the training that led to the resulting model is completely different.