Extending ZeusDataLoader to multi-GPU single-node

Description

This pull request is to extend ZeusDataLoader from single-GPU to single-node multi-GPU, to support distributed data-parallel training.

Implementation

ZeusDataLoader
- Add a new parameter distributed = None | "dp" to decide which distributed strategy will be used. "dp" represents data parallel.
- If data parallel mode is turned on, get the process's local rank. A process with local_rank == 0 is the master process.
- Centralized management. Master process will take charge of everything, including:
- Spawn/kill zeus_monitor
- Set GPU power limits when profiling
- Compute the cost function, and predict the cost after the next epoch to decide whether early stop or not
- Load/write to files
- Logging (almost all the logging; there are a few GPU-specific logging done by each process itself)
example/imagenet
- Enable "dp" mode
- args.distributed == True ==> "dp". The value of this variable based on local_rank and args.multiprocessing_distributed.
  - If using dist.launch, local_rank is passed in by launch.py.
  - If using slurm/torchrun, set local_rankaccording to an environment variable.
  - If using torch.multiprocessing, user has to manually set --multiprocessing-distributed. (Original framework used by pytorch in this script)
- # ZEUS marks the parts related to zeus; # DATA PARALLEL marks the parts related to data parallel.

Important changes to the original framework

ZeusDataLoader:
- warmup_secs and profile_secs are replaced with warmup_iters and profile_iters. We change from a time-based profile window to an iteration-based profile window.
- train_batch_size is scaled because in examples/imagenet/train.py, the actual batch_size passed in each ZeusDataLoader is batch_size / num_gpus. We scale it back in ZeusDataLoader for the consistency with names of files (eg. power JSON).
- In epochs(), if the predicated cost after the next epoch exceeds the cost_threshold, a customized exception ZeusCostThresholdExceededException will be raised. In the original design, it was logging and return. We made this change because we want the master process to stop all the other processes.

Launching methods for `train.py`

Using torch.multiprocessing

$ python train.py [DATA_DIR] --multiprocessing_distributed --zeus [OTHER_OPTIONS]

Using torch.distributed.launch utility:

$ python -m torch.distributed.launch --nnodes=1 --nproc_per_node=[NUM_OF_GPUS] train.py [DATA_DIR] --zeus [OTHER_OPTIONS]

Using torchrun:

$ torchrun --nnodes 1 --nproc_per_node [NUM_OF_GPUS] train.py [DATA_DIR] --zeus --torchrun [OTHER_OPTIONS]

Using Slurm:

Example script.sh:

    #!/bin/bash
    #SBATCH --partition=gpu
    #SBATCH --nodes=1                           # number of nodes (single node)
    #SBATCH --ntasks-per-node=[NUM_OF_GPUS]     # number of tasks per node (1 task per GPU)
    #SBATCH --gres=gpu:[NUM_OF_GPUS]            # number of GPUs reserved per node (here all the GPUs)
    #SBATCH --mem=64GB

On terminal:
```
$ sbatch script.sh
```

Closes #4

ml-energy / zeus

Extending ZeusDataLoader to multi-GPU single-node #2

Description

Implementation

Important changes to the original framework

Launching methods for `train.py`

ml-energy / zeus

Extending ZeusDataLoader to multi-GPU single-node #2

Description

Implementation

Important changes to the original framework

Launching methods for train.py

Launching methods for `train.py`