Open albertz opened 4 years ago
(The assignment is just such that you keep track of this, as this might be relevant for you. Of course feel free to participate more actively as well! E.g. by having certain constraints, or wishes. Maybe you can comment how the current draft is not compatible with what you need. Or what should be added. Or even what should be the priority.)
Some update: Some initial implementation is there now. However this mostly just covers the cluster setup, and starting the TF servers. This is mostly also intended for between-graph replication. But further than that, there is not much implemented yet. So this is a good base for further work, but as-is this is not a complete solution for distributed training yet.
The main goal in any case: Efficient/fast training on various different hardware (single-node multi-GPU (consumer GPUs...), multi-node multi-GPU, TPU) and environments (SGE cluster with common hardware, or AWS, or GCE). It does not matter too much in the end whether we use the Horovod ops, or distributed TF functions. We should just see what works best. The dataset pipeline might be related to it.
See the overview of distributed TensorFlow in general (independent of RETURNN) for some background.
This issue here is about the specific implementation in RETURNN.
This is somewhat orthogonal to the current Horovod implementation (documentation). We should allow for both distributed TF and Horovod, and maybe even mix them.
This is also related to the new dataset pipeline (#292).
We collect experience for all possible variations of hardware, cluster environment, software (Horovod, distributed TF, new/old dataset pipeline), and strategy here in the wiki.
Rough draft:
TFDistributed.py
with all related functionality.tf.distribute.Strategy
) (by default) (although we might want to allow for that as well).horovod_reduce_type = "param"
, or our Theano implementation): Every replica has its own copy of variables (no parameter server), they would do independent (async) training (and between-graph replication), and at certain points (e.g. N training steps), they would synchronize by averaging the parameters.qsub -pe
) which has basic MPI communication across hosts (although we might not need the MPI communication at all, as distributed TF has its own communication).1000 + int(os.environ["CUDA_VISIBLE_DEVICES"]) * 100
or sth like that. We should make sure this would not collide with other services. If we have MPI, we might use mpi4py and can communicate that way (e.g. to get rank (MPI.COMM_WORLD.rank
) / size (MPI.COMM_WORLD.size
) or communicate other information). Otherwise, SGE PE will probably provide some other more direct way to get this information as well. Edit: Done via MPI now. See doc in code.