DDP/MP not yielding nontrivial speedup

🐛 Bug

Following the tutorial from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html I implemented a distributed policy gradient reinforcement learning algorithm. Using the script below I benchmarked 1000 steps on a simple gym environment and recorded the time per worker. Since I'm on a six core machine I was expecting a nontrivial speedup per global step in the order of 1 <= num_processes <= 6. This is not the case (see output below).

I'm not trying to do any learning here, just benchmarking. I observed the same behavior when using DDP and even asynchonous code from the hogwild example at https://pytorch.org/docs/stable/notes/multiprocessing.html .

To Reproduce

Steps to reproduce the behavior:

Run script below with different values for num_processes

import sys
import os
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from models import DNNPolicy, FCPolicy
import time
import gym

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    dist.init_process_group("gloo", rank=rank, world_size=world_size)
    torch.manual_seed(42)

def cleanup():
    dist.destroy_process_group()

def main(rank, world_size):
    setup(rank, world_size)
    env = gym.make('CartPole-v0')
    policy = DNNPolicy(env)
    state = env.reset()
    loop_start = time.time()
    for i in range(1000):
        policy(torch.randn(240, 256, 3))
        #time.sleep(0.005)
    loop_end = time.time()
    print("loop time worker:", rank, "nprocs:", world_size, loop_end - loop_start, "s")

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == "__main__":
    for n_procs in range(6):
        run_demo(main, n_procs+1)

Results in:

loop time worker: 0 nprocs: 1 4.527710199356079 s loop time worker: 0 nprocs: 2 5.222243547439575 s loop time worker: 1 nprocs: 2 5.496668577194214 s loop time worker: 2 nprocs: 3 7.741548776626587 s loop time worker: 0 nprocs: 3 7.951193809509277 s loop time worker: 1 nprocs: 3 8.345336198806763 s loop time worker: 0 nprocs: 4 12.702503442764282 s loop time worker: 1 nprocs: 4 13.029954195022583 s loop time worker: 3 nprocs: 4 13.766679286956787 s loop time worker: 2 nprocs: 4 13.906413793563843 s loop time worker: 0 nprocs: 5 16.707841873168945 s loop time worker: 3 nprocs: 5 17.413241863250732 s loop time worker: 2 nprocs: 5 17.777271032333374 s loop time worker: 4 nprocs: 5 17.782108306884766 s loop time worker: 1 nprocs: 5 17.930848836898804 s loop time worker: 0 nprocs: 6 17.15580153465271 s loop time worker: 2 nprocs: 6 17.536866664886475 s loop time worker: 3 nprocs: 6 18.087643146514893 s loop time worker: 4 nprocs: 6 18.502392053604126 s loop time worker: 5 nprocs: 6 18.63866925239563 s loop time worker: 1 nprocs: 6 19.046801805496216 s

Expected behavior

Loop time staying roughly constant for increasing amount of workers.

Environment

PyTorch version: 1.3.1 Is debug build: No CUDA used to build PyTorch: 10.1.243

OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: Could not collect

Python version: 3.6 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA

Versions of relevant libraries: [pip] numpy==1.17.4 [pip] torch==1.3.1 [conda] torch 1.3.1 pypi_0 pypi

Additional context

Replacing the call to policy with time.sleep(0.005) yields the timing behavior I was initially expecting. Why does the cost of calling policy go up as the number of workers increase?

cc @VitalyFedyunin @ngimel @mruberry

pytorch / pytorch