This pull request is to extend ZeusDataLoader from single-GPU to single-node multi-GPU, to support distributed data-parallel training.
Implementation
ZeusDataLoader
Add a new parameter distributed = None | "dp" to decide which distributed strategy will be used. "dp" represents data parallel.
If data parallel mode is turned on, get the process's local rank. A process with local_rank == 0 is the master process.
Centralized management. Master process will take charge of everything, including:
Spawn/kill zeus_monitor
Set GPU power limits when profiling
Compute the cost function, and predict the cost after the next epoch to decide whether early stop or not
Load/write to files
Logging (almost all the logging; there are a few GPU-specific logging done by each process itself)
example/imagenet
Enable "dp" mode
args.distributed == True ==> "dp". The value of this variable based on local_rank and args.multiprocessing_distributed.
If using dist.launch, local_rank is passed in by launch.py.
If using slurm/torchrun, set local_rankaccording to an environment variable.
If using torch.multiprocessing, user has to manually set --multiprocessing-distributed. (Original framework used by pytorch in this script)
# ZEUS marks the parts related to zeus; # DATA PARALLEL marks the parts related to data parallel.
Important changes to the original framework
ZeusDataLoader:
warmup_secs and profile_secs are replaced with warmup_iters and profile_iters. We change from a time-based profile window to an iteration-based profile window.
train_batch_size is scaled because in examples/imagenet/train.py, the actual batch_size passed in each ZeusDataLoader is batch_size / num_gpus. We scale it back in ZeusDataLoader for the consistency with names of files (eg. power JSON).
In epochs(), if the predicated cost after the next epoch exceeds the cost_threshold, a customized exception ZeusCostThresholdExceededException will be raised. In the original design, it was logging and return. We made this change because we want the master process to stop all the other processes.
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1 # number of nodes (single node)
#SBATCH --ntasks-per-node=[NUM_OF_GPUS] # number of tasks per node (1 task per GPU)
#SBATCH --gres=gpu:[NUM_OF_GPUS] # number of GPUs reserved per node (here all the GPUs)
#SBATCH --mem=64GB
Description
This pull request is to extend
ZeusDataLoader
from single-GPU to single-node multi-GPU, to support distributed data-parallel training.Implementation
ZeusDataLoader
distributed = None | "dp"
to decide which distributed strategy will be used."dp"
represents data parallel.local_rank == 0
is the master process.zeus_monitor
example/imagenet
args.distributed
== True ==> "dp". The value of this variable based onlocal_rank
andargs.multiprocessing_distributed
.dist.launch
,local_rank
is passed in bylaunch.py
.local_rank
according to an environment variable.torch.multiprocessing
, user has to manually set--multiprocessing-distributed
. (Original framework used by pytorch in this script)# ZEUS
marks the parts related to zeus;# DATA PARALLEL
marks the parts related to data parallel.Important changes to the original framework
ZeusDataLoader
:warmup_secs
andprofile_secs
are replaced withwarmup_iters
andprofile_iters
. We change from a time-based profile window to an iteration-based profile window.train_batch_size
is scaled because inexamples/imagenet/train.py
, the actualbatch_size
passed in eachZeusDataLoader
isbatch_size / num_gpus
. We scale it back inZeusDataLoader
for the consistency with names of files (eg. power JSON).epochs()
, if the predicated cost after the next epoch exceeds thecost_threshold
, a customized exceptionZeusCostThresholdExceededException
will be raised. In the original design, it was logging and return. We made this change because we want the master process to stop all the other processes.Launching methods for
train.py
torch.multiprocessing
torch.distributed.launch
utility:torchrun
:Closes #4