WideResNets for CIFAR10/100 implemented in PyTorch. This implementation requires less GPU memory than what is required by the official Torch implementation: https://github.com/szagoruyko/wide-residual-networks.
Example:
python train.py --dataset cifar100 --layers 40 --widen-factor 4
The script run.py
will run train.py
and write summary CSVs into
output/{today
. The run.py
script runs the commands
python train.py --qsgd=1 # use QSGD coding
python train.py --compress=1 --svd_rank=0 --svd_rescale=1 # use SVD coding
python train.py --compress=0 # use normal SGD with the param server
Note that extra arguments are added to each of these commands.
pytorch_ps_mpi/
ps.py
mpi_comms.py
codings/
coding.py
...
train.py
run.py
pytorch_ps_mpi
: The package that manages distributed training. It is a
separate Git repository at https://github.com/stsievert/pytorch_ps_mpi (and
is unfortunately named).
ps.py
: The main file in this package, which holds the different
optimizers. These are slightly modified from torch.optim
optimizers. In
this, we code the gradients asynchrously with gradient computation. We then
wait for all codings to finish before sending them.mpi_comms.py
: This is the script that serializes the gradients and sends
them. In this file, there's a class Iallgather
. That's what we use (and it's
named after the MPI primitive that does what we want).codings
: The package with different coding schemes. The base coding class
is in coding.py
.train.py
: The main training script. run.py
calls this with an os.system
call.mpirun -n 3 -hostfile hosts --map-by ppr:1:node python train.py
A quick speed test with 2 p2.xlarges and 34 layers:
And with 100 layers:
How will this change as the number of workers increase?