ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.85k stars 5.75k forks source link

[AIR] Add more details about checkpoint loading and saving. #32249

Open woshiyyya opened 1 year ago

woshiyyya commented 1 year ago

Description

1. Checkpoint Loading

I found a good doc A Guide To Using Checkpoints, but it mainly discussed how to save a checkpoint throughRay.Tuner. We need more examples about how to load checkpoints, and how to save&load checkpoint with Ray.Trainer.

2. Checkpoint Directory Structure

Currently there's no doc explaining our log directory structure. I think it'd be good to have the following contents:

3. MISC

Link

No response

xwjiang2010 commented 1 year ago

Thanks for putting this together. These are great points. Let me add some notes to it.

Many examples in our doc load checkpoints from in memory python objects. Examples like this are not quite helpful.

This is also reflected in bugs like this kind: https://github.com/ray-project/ray/issues/32284

What files will be generated and what's the content inside?

This is also reflected in this thread: https://discuss.ray.io/t/do-trial-checkpoints-need-unique-names-pytorch-tutorial/9316

There is a feature that automatically truncating parameters' "module." prefix. We should declare this feature somewhere explicitly.

This behavior was "sort of" documented by https://github.com/ray-project/ray/pull/31791. Maybe we should find it a better home. Where do you have in mind?

justinvyu commented 1 year ago

I found a good doc A Guide To Using Checkpoints, but it mainly discussed how to save a checkpoint throughRay.Tuner. We need more examples about how to load checkpoints, and how to save&load checkpoint with Ray.Trainer.

@woshiyyya On point 1, this guide is actually changed for 2.3.0, and we definitely need another one that's actually a "guide to using checkpoints" as you mention. This is the new version of that guide (less emphasis on checkpointing): https://docs.ray.io/en/master/tune/tutorials/tune-storage.html

anyscalesam commented 1 year ago

@justinvyu have we addressed this as part of the checkpoint documentation revamp shipped with ray27?