Support for distributed TensorFlow: draft

albertz commented 4 years ago

See the overview of distributed TensorFlow in general (independent of RETURNN) for some background.

This issue here is about the specific implementation in RETURNN.

This is somewhat orthogonal to the current Horovod implementation (documentation). We should allow for both distributed TF and Horovod, and maybe even mix them.

This is also related to the new dataset pipeline (#292).

We collect experience for all possible variations of hardware, cluster environment, software (Horovod, distributed TF, new/old dataset pipeline), and strategy here in the wiki.

Rough draft:

New file TFDistributed.py with all related functionality.
We would not use the predefined TF strategies (tf.distribute.Strategy) (by default) (although we might want to allow for that as well).
Our standard/default strategy would be similar as before (Horovod horovod_reduce_type = "param", or our Theano implementation): Every replica has its own copy of variables (no parameter server), they would do independent (async) training (and between-graph replication), and at certain points (e.g. N training steps), they would synchronize by averaging the parameters.
Our initial implementation should work together with SGE parallel environment (qsub -pe) which has basic MPI communication across hosts (although we might not need the MPI communication at all, as distributed TF has its own communication).
As multiple processes can run on a single SGE node, we need some convention or way to figure out which port to use, which also avoids race conditions as much as possible. E.g. maybe 1000 + int(os.environ["CUDA_VISIBLE_DEVICES"]) * 100 or sth like that. We should make sure this would not collide with other services. If we have MPI, we might use mpi4py and can communicate that way (e.g. to get rank (MPI.COMM_WORLD.rank) / size (MPI.COMM_WORLD.size) or communicate other information). Otherwise, SGE PE will probably provide some other more direct way to get this information as well. Edit: Done via MPI now. See doc in code.
The chief worker (MPI rank 0) might spawn a dedicated subprocess as a dataset loader producer worker for the new dataset pipeline (#292). (Or can SGE PE spawn dedicated CPU-only processes alongside with the GPU processes? I don't think so...)

albertz commented 4 years ago

(The assignment is just such that you keep track of this, as this might be relevant for you. Of course feel free to participate more actively as well! E.g. by having certain constraints, or wishes. Maybe you can comment how the current draft is not compatible with what you need. Or what should be added. Or even what should be the priority.)

albertz commented 4 years ago

Some update: Some initial implementation is there now. However this mostly just covers the cluster setup, and starting the TF servers. This is mostly also intended for between-graph replication. But further than that, there is not much implemented yet. So this is a good base for further work, but as-is this is not a complete solution for distributed training yet.

The main goal in any case: Efficient/fast training on various different hardware (single-node multi-GPU (consumer GPUs...), multi-node multi-GPU, TPU) and environments (SGE cluster with common hardware, or AWS, or GCE). It does not matter too much in the end whether we use the Horovod ops, or distributed TF functions. We should just see what works best. The dataset pipeline might be related to it.

rwth-i6 / returnn

Support for distributed TensorFlow: draft #296