ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.23k stars 5.62k forks source link

[tune] Make sure we have checkpoint/restore examples in the docs (e.g. for xgboost) #15244

Open krfricke opened 3 years ago

krfricke commented 3 years ago

Some examples/tutorials in the docs only use tuning, but no checkpointing, like here: https://docs.ray.io/en/master/tune/tutorials/tune-xgboost.html

We should make sure we include examples for saving/restoring checkpoints, especially when using things like callbacks, as in the case with xgboost.

nikhil-sthalekar commented 3 years ago

Any chance someone can point me to an example that uses the Trainable Class API and checkpointing? I want to use the Class API so I can reuse actors when training, and then once I have the best model, immediately use that for some predictions. With the current documentation its unclear how this should work.

krfricke commented 3 years ago

Hi @nikhil-sthalekar, does this section of the docs help? https://docs.ray.io/en/master/tune/api_docs/trainable.html#class-api-checkpointing

nikhil-sthalekar commented 3 years ago

Hi @krfricke , Using those docs I was able to get the checkpointing to work some of the time, but the script using the class API was failing otherwise. I am getting similar errors using the function API and the TuneReportCheckpointCallback, but my script is able to load the "best checkpoint" from the training run. Right now I'm looking into using the durable wrapper for checkpointing.

krfricke commented 3 years ago

You're welcome to share your code on https://discuss.ray.io/ for feedback!

sjhermanek commented 11 months ago

Bumping this up -- I'd still be quite interested in a worked example here (particularly in the distributed setting using ray_xgboost)