Drop/Revisit usage of `checkpoints`

daavoo commented 1 year ago

Are we in agreement that we don't support checkpoint anymore? (I'm personally still not convinced. Primarily because I'm not sure we have a decent replacement for this. I think it's needed if we were to remove this).

Originally posted by @shcheklein in https://github.com/iterative/dvc.org/issues/4415#issuecomment-1476721592

I don't really think we need a built-in replacement/solution in DVC to handle the checkpoints use case (which I am still unsure what that is).

People should handle interruption and resuming through the ML framework and DVC already provides convenient tools to wrap it (params, persist, run-cache).

My main points about dropping checkpoints are:

The current solution doesn't provide value while coming at an important cost of code/docs maintenance.
Induces users into incorrect workflows and the required changes in the code are not properly explained anywhere.
Introduces ambiguous/unexpected behavior when executing more complex/realistic pipelines (i.e. how are downstream stages after checkpoint: true supposed to be executed?; what about foreach or dvc.yaml with more than 1 model?)

As an example of the second point, here are the things that are "incorrect" in this repo (same applies to the example in https://dvc.org/doc/user-guide/experiment-management/checkpoints):

Optimizer state is not handled.

The state_dict of the optimizer should be also considered when loading/saving.

No learning rate scheduler.

I would dare to say that using a fixed learning rate would never result in a better model than using any kind of lr scheduler.

It would also require to be handled when loading/saving (which connects with the issues in the next point).

Epochs are being handled (arguably) incorrectly.

When picking a checkpoint and resuming from it, the epochs param is now treated as epochs + epochs_completed_at_checkpoint which differs with the meaning when training without resuming where epochs params reflects the total number.

After resuming from a checkpoint, the experiments can't be reproduced easily.

Let's say we have a completed experiments that was using checkpoints:

# EXP_A
lr: 0.003
weight_decay: 0
epochs: 15

If I run:

$ dvc exp apply EXP_A_CHECKPOINT_5
$ dvc exp run -S lr=99 weight_decay=0 epochs=10

It is not possible to reproduce the experiment with a single command. We would have to run the exact combination of exp apply and exp run. It is not possible to reproduce the experiment at all if the checkpoints are deleted.

Resumed experiments are not differentiable after persisting.

Let's say I have another experiment completed using checkpoints:

# EXP_B
lr: 0.1
weight_decay: 10
epochs: 40

And I run:

$ dvc exp apply EXP_B_CHECKPOINT_39
$ dvc exp run -S lr=99 weight_decay=0 epochs=10

Persisting this experiment or the one from the previous point will result in an equivalent state in the repo regarding params and step metric, even though the training that led to the resulting model is completely different.

shcheklein commented 1 year ago

I think we need to move it to DVC tbh, it's a bigger discussion that we need to have and agree. We've discussed briefly with @daavoo today, and agreed that we at least want to discuss one thing that checkpoints give - ability to recover. For this we need either a replacement plan (outside), or simplify the existing logic somehow, or something else. It's worth summarizing all the potential benefits of the existing solution (if it's implemented correctly), again to make sure that we are not missing anything.

pmrowla commented 1 year ago

I agree that checkpoint: true/dvc.api.make_checkpoint() is not particularly useful for users, and that it's better to just provide the mechanism for a user's stage to generate an intermediate commit through the ML frameworks (which we already do with dvclive).

It seems like what we actually need in place of checkpoint: true is proper support for circular dependencies (the existing checkpoint: true/persist: true flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)

Resumed experiments are not differentiable after persisting.

IMO this is a problem with the fact that exp apply; git add .; git commit is the default suggested workflow for persisting an experiment. With exp branch (or an alternative git checkout based workflow) the git commits prior to resuming are available after persisting the experiment.

daavoo commented 1 year ago

With exp branch (or an alternative git checkout based workflow) the git commits prior to resuming are available after persisting the experiment

My point was they would be equivalent at the level our UIs currently compare (except for the plots): Studio compare, exp/params diff, etc.

It seems like what we actually need in place of checkpoint: true is proper support for circular dependencies (the existing checkpoint: true/persist: true flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)

To give you more context, from the product perspective we have been discussing that the cherry picking use case of checkpoints is not valuable enough and we would like to think what could be the shortest path to enable resuming from an interrupted training.

So, putting the current implementation aside (al least initially to not biase discussing) and focusing on that use case.

In a workspace, single-experiment scenario, this already world by using persist, a resume param and letting DVCLiv and the framework the responsibility of handling the rest in the Python Code. The idea is that we don't carne about all iterations and we can assume that the current model on workspace would always be the one we want to resume.

Maybe we can start from that and thinking what are the Minimum chances needed to support recovering in other scenarios, like a remote instance that might get shutted down.

dberenbaum commented 1 year ago

Is there agreement on both of these points?

It would be better to rely on the functionality of existing ML frameworks where possible rather than have a separate way to checkpoint models.
If you are training remotely and the machine shuts down, there's often no way to recover the last saved checkpoint on the new remote machine.

For 2, I have been meaning to open https://github.com/iterative/dvclive/issues/505 for awhile now to discuss a possible solution.

daavoo commented 1 year ago

Is there agreement on both of these points?

I can only speak for myself but I agree and https://github.com/iterative/dvclive/issues/505 looks like the direction I would prefer to go

dberenbaum commented 1 year ago

Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.

daavoo commented 1 year ago

Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.

Anything that DVCLive does/will do could be replicated by non-Python tools (background dvc push, Studio REST API calls)

dberenbaum commented 1 year ago

Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show or in the VS Code views).

mattseddon commented 1 year ago

Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show or in the VS Code views).

We also agreed that when I take on the work of integrating the new --json format in #9170 I am going to gamble and not integrate checkpoints (unless necessary for some remaining feature). I'll scope and create a linked issue in vscode-dvc for that work.

dberenbaum commented 1 year ago

@pmrowla See above where @mattseddon is planning to drop checkpoints from the experiments table. We can do the same in the CLI table if it will save us maintenance time, and it might even be useful so that we can see if anyone complains.

shcheklein commented 1 year ago

Don't know what the use case yet, but yolo repo uploads every checkpoint to its hub:

def on_model_save(trainer):
    session = getattr(trainer, 'hub_session', None)
    if session:
        # Upload checkpoints with rate limiting
        is_best = trainer.best_fitness == trainer.fitness
        if time() - session.timers['ckpt'] > session.rate_limits['ckpt']:
            LOGGER.info(f'{PREFIX}Uploading checkpoint {session.model_id}')
            session.upload_model(trainer.epoch, trainer.last, is_best)
            session.timers['ckpt'] = time()  # reset timer

iterative / dvc

Drop/Revisit usage of `checkpoints` #9221