Multi-Worker Distributed Training

sahilpatelsp commented 2 years ago

I was wondering what exactly is the appropriate way to launch multi-worker distributed training jobs with xmanager. Based on my current understanding, it seems that a Job must be created for each worker pool and all theseJobs must be combined into a single JobGroup, which is then added to the experiment. There also seems to be an option to add Constraints for the JobGroup, however, I cannot seem to find what specific form these constraints may be able to take on besides the provided example of xm_impl.SameMachine(). Furthermore, my current attempt at launching a multi-worker distributing training job raises the following error when creating the distributed strategy with strategy = tf.distribute.MultiWorkerMirroredStrategy(): RuntimeError: Collective ops must be configured at program startup. Both the CLUSTER_SPECand TF_CONFIG environment variables seem to be set correctly and the distributed strategy is created at the very beginning of the main function, so I was curious if this error might be due to the lack of setting appropriate Constraints for the JobGroup.

andrewluchen commented 2 years ago

We should include a TF distributed experiment to the examples.

There is a PyTorch one that you can look at: https://github.com/deepmind/xmanager/blob/main/examples/cifar10_torch/launcher.py

sahilpatelsp commented 2 years ago

Thanks for pointing that out! I modified my code to align with the provided example, with the main changes being utilizing the async/await syntax of the python asyncio library. However, I am still getting RuntimeError: Collective ops must be configured at program startup when calling strategy = tf.distribute.MultiWorkerMirroredStrategy() at the very beginning of the main function of the training file. I am unsure as to what exactly might be responsible for this error, given that I am creating the strategy before calling any other Tensorflow API, as per https://github.com/tensorflow/tensorflow/blob/3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/python/distribute/collective_all_reduce_strategy.py#L155.

google-deepmind / xmanager

Multi-Worker Distributed Training #16