[AIR] Add more details about checkpoint loading and saving.

woshiyyya commented 1 year ago

Description

1. Checkpoint Loading

Many examples in our doc load checkpoints from in memory python objects.
- Examples like this are not quite helpful.
A more realistic scenario for users would be:
- They have two separate scripts/jobs, one for training and one for evaluation and inference.
- They will load checkpoints from disk/cloud storage.

I found a good doc A Guide To Using Checkpoints, but it mainly discussed how to save a checkpoint throughRay.Tuner. We need more examples about how to load checkpoints, and how to save&load checkpoint with Ray.Trainer.

2. Checkpoint Directory Structure

Currently there's no doc explaining our log directory structure. I think it'd be good to have the following contents:

What files will be generated and what's the content inside?
How to use the checkpoint?
- Which directory should be passed to Checkpoint.from_directory()?
- Teach users to use Checkpoint.from_directory() instead of model.load_state_dict(torch.load(PATH)). Most users intend to use the later one at the beginning.

3. MISC

There is a feature that automatically truncating parameters' "module." prefix. We should declare this feature somewhere explicitly.
We make overly simplistic assumptions about what will be saved in a checkpoint: only one model will be saved with "model" as its key. What about we need to save multiple models? or may be more info? How can we restore them?
- e.g. for GAN
```
checkpoint = {
"netG": netG.state_dict(),
"netD": netD.state_dict(),
"optimizer_state": optimizer.state_dict(),
}
```

Link

No response

xwjiang2010 commented 1 year ago

Thanks for putting this together. These are great points. Let me add some notes to it.

Many examples in our doc load checkpoints from in memory python objects. Examples like this are not quite helpful.

This is also reflected in bugs like this kind: https://github.com/ray-project/ray/issues/32284

What files will be generated and what's the content inside?

This is also reflected in this thread: https://discuss.ray.io/t/do-trial-checkpoints-need-unique-names-pytorch-tutorial/9316

There is a feature that automatically truncating parameters' "module." prefix. We should declare this feature somewhere explicitly.

This behavior was "sort of" documented by https://github.com/ray-project/ray/pull/31791. Maybe we should find it a better home. Where do you have in mind?

justinvyu commented 1 year ago

I found a good doc A Guide To Using Checkpoints, but it mainly discussed how to save a checkpoint throughRay.Tuner. We need more examples about how to load checkpoints, and how to save&load checkpoint with Ray.Trainer.

@woshiyyya On point 1, this guide is actually changed for 2.3.0, and we definitely need another one that's actually a "guide to using checkpoints" as you mention. This is the new version of that guide (less emphasis on checkpointing): https://docs.ray.io/en/master/tune/tutorials/tune-storage.html

anyscalesam commented 1 year ago

@justinvyu have we addressed this as part of the checkpoint documentation revamp shipped with ray27?

ray-project / ray