Distributed Training with Apex

Apex utilities https://github.com/NVIDIA/apex handle some issues with specific nodes in the FloWaveNet architecture.

List of changes made in train.py:

Determine local_rank and world_size for torch.distributed.init_process_group
Set a current device with torch.cuda.set_device
Wrap dataset with torch.utils.data.distributed.DistributedSampler
Apply amp.scale_loss at each backward pass
Clip gradient with amp.master_params
Divide step_size by world_size (not sure if this is necessary)
Initialize model and optimizer with amp.initialize
Wrap model with apex.parallel.DistributedDataParallel
Handle evaluation and messages on the first node using args.local_rank

For example, to run on 4 GPUs, use the following command: python -m torch.distributed.launch --nproc_per_node=4 train_apex.py --num_workers 2 --epochs 1000

Resolves: #13 See also: #16

ksw0306 / FloWaveNet

Distributed Training with Apex #22