petuum / autodist

Simple Distributed Deep Learning on TensorFlow
https://petuum.github.io/autodist
Apache License 2.0
134 stars 26 forks source link

Autodist on Ray using RaySGD API #61

Open odp opened 3 years ago

odp commented 3 years ago

This PR adds RaySGD API to Autodist which enables it to train models on a Ray cluster. The API defines a TFTrainer class which takes a model creator, data creator, train step and a strategy builder and runs the training job on a distributed Ray cluster. The API follows the RaySGD API and is compatible with Ray Tune.

trainer = TFTrainer(strategy_builder, model_creator, data_creator, train_step)
trainer.step()

Internally it implements a TFRunner class which represents a replica. All communication between master and worker replicas happens through in-memory object store so there is no dependance on remote file system locations/accesses rights. Also ssh is not needed.

Moreover the client code executed by each worker is also replicated using Ray eliminating the need of copying the model code to remote filesystems on each node. The users can run the example by installing Ray and running $ python linear_regression_ray.py.

Reference: https://docs.ray.io/en/master/raysgd/raysgd_tensorflow.html

Fixes #57