Open woshiyyya opened 1 year ago
Thanks for putting this together. These are great points. Let me add some notes to it.
Many examples in our doc load checkpoints from in memory python objects. Examples like this are not quite helpful.
This is also reflected in bugs like this kind: https://github.com/ray-project/ray/issues/32284
What files will be generated and what's the content inside?
This is also reflected in this thread: https://discuss.ray.io/t/do-trial-checkpoints-need-unique-names-pytorch-tutorial/9316
There is a feature that automatically truncating parameters' "module." prefix. We should declare this feature somewhere explicitly.
This behavior was "sort of" documented by https://github.com/ray-project/ray/pull/31791. Maybe we should find it a better home. Where do you have in mind?
I found a good doc A Guide To Using Checkpoints, but it mainly discussed how to save a checkpoint throughRay.Tuner. We need more examples about how to load checkpoints, and how to save&load checkpoint with Ray.Trainer.
@woshiyyya On point 1, this guide is actually changed for 2.3.0, and we definitely need another one that's actually a "guide to using checkpoints" as you mention. This is the new version of that guide (less emphasis on checkpointing): https://docs.ray.io/en/master/tune/tutorials/tune-storage.html
@justinvyu have we addressed this as part of the checkpoint documentation revamp shipped with ray27?
Description
1. Checkpoint Loading
I found a good doc A Guide To Using Checkpoints, but it mainly discussed how to save a checkpoint through
Ray.Tuner
. We need more examples about how to load checkpoints, and how to save&load checkpoint withRay.Trainer
.2. Checkpoint Directory Structure
Currently there's no doc explaining our log directory structure. I think it'd be good to have the following contents:
Checkpoint.from_directory()
?Checkpoint.from_directory()
instead ofmodel.load_state_dict(torch.load(PATH))
. Most users intend to use the later one at the beginning.3. MISC
Link
No response