tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.81k stars 720 forks source link

Setup for Actor-Learner API for Distributed Collection and Training in a Cluster #676

Open JCMiles opened 3 years ago

JCMiles commented 3 years ago

Hi Team, I'm trying to run the Actor-Learner API for Distributed Collection and Training as exaplained here: https://github.com/tensorflow/agents/tree/master/tf_agents/experimental/distributed/examples/sac but on multipe machines.

Based on Reverb docs, let's say I have 3 machines A > IP: 227.57.48.210 B > IP: 227.57.48.211 C > IP: 227.57.48.212

    1. on machine A -> Run  sac_reverb_server.py on port 8008

    2. on machine B -> Run  sac_collect.py with:
          -- replay_buffer_server_address='27.57.48.210:8008'
          --variable_container_server_address='27.57.48.210:8008'

    3. on machine C -> Run  sac_train.py with:               
          -- replay_buffer_server_address='27.57.48.210:8008'
          --variable_container_server_address='27.57.48.210:8008'

But in the example from the link above both sac_reverb_server.py and sac_collect.py wait for sac_train.py to write the policies in a given folder on the same machine before running their respective operations. In a multi device setup how can sac_reverb_server.py and sac_collect.py being informed from where to load the policy? There is a tf_agents built-in function or a defined procedure to manage that or this need to be implemented from scratch with a custom script ?

tfboyd commented 3 years ago

@JCMiles I have failed before to set expectations. I am looking at distributed training (using multiple machines) on google cloud. To share the model I think you can put it in a GCS bucket as the location. I am 99% sure we are using the tensorflow checkpoint reader which supports a bunch (I think a bunch is the right word) of network storage options that are not file system native, e.g. GCS and maybe even S3 (although I think when I was doing AWS I just mount a shared disk...it has been 3+ years since I has AWS knowledge). I suspect you want something to scale larger but you can get a lot of scale out of a single machine by spinning up a bunch of agents on a single machine.

I hope due to this other project I can update those documents. But if you want/can I am happy to chat back and forth in to try to help here. I assigned myself so I should see comments you can also AT me.

JCMiles commented 2 years ago

@tfboyd sorry for the delay I just saw this. Thx for your effort. It would be amazing to have a clear documentation about the correct setup for this type of training pipeline. My timezone is UTC+2 so let me know what time fits you better. I'm available for chatting tomorrow and on friday.