rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

Support for distributed TensorFlow: draft #296

Open albertz opened 4 years ago

albertz commented 4 years ago

See the overview of distributed TensorFlow in general (independent of RETURNN) for some background.

This issue here is about the specific implementation in RETURNN.

This is somewhat orthogonal to the current Horovod implementation (documentation). We should allow for both distributed TF and Horovod, and maybe even mix them.

This is also related to the new dataset pipeline (#292).

We collect experience for all possible variations of hardware, cluster environment, software (Horovod, distributed TF, new/old dataset pipeline), and strategy here in the wiki.


Rough draft:

albertz commented 4 years ago

(The assignment is just such that you keep track of this, as this might be relevant for you. Of course feel free to participate more actively as well! E.g. by having certain constraints, or wishes. Maybe you can comment how the current draft is not compatible with what you need. Or what should be added. Or even what should be the priority.)

albertz commented 4 years ago

Some update: Some initial implementation is there now. However this mostly just covers the cluster setup, and starting the TF servers. This is mostly also intended for between-graph replication. But further than that, there is not much implemented yet. So this is a good base for further work, but as-is this is not a complete solution for distributed training yet.

The main goal in any case: Efficient/fast training on various different hardware (single-node multi-GPU (consumer GPUs...), multi-node multi-GPU, TPU) and environments (SGE cluster with common hardware, or AWS, or GCE). It does not matter too much in the end whether we use the Horovod ops, or distributed TF functions. We should just see what works best. The dataset pipeline might be related to it.