pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.5k stars 22.76k forks source link

DDP/MP not yielding nontrivial speedup #31360

Open timtody opened 4 years ago

timtody commented 4 years ago

🐛 Bug

Following the tutorial from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html I implemented a distributed policy gradient reinforcement learning algorithm. Using the script below I benchmarked 1000 steps on a simple gym environment and recorded the time per worker. Since I'm on a six core machine I was expecting a nontrivial speedup per global step in the order of 1 <= num_processes <= 6. This is not the case (see output below).

I'm not trying to do any learning here, just benchmarking. I observed the same behavior when using DDP and even asynchonous code from the hogwild example at https://pytorch.org/docs/stable/notes/multiprocessing.html .

To Reproduce

Steps to reproduce the behavior:

  1. Run script below with different values for num_processes
import sys
import os
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from models import DNNPolicy, FCPolicy
import time
import gym

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    dist.init_process_group("gloo", rank=rank, world_size=world_size)
    torch.manual_seed(42)

def cleanup():
    dist.destroy_process_group()

def main(rank, world_size):
    setup(rank, world_size)
    env = gym.make('CartPole-v0')
    policy = DNNPolicy(env)
    state = env.reset()
    loop_start = time.time()
    for i in range(1000):
        policy(torch.randn(240, 256, 3))
        #time.sleep(0.005)
    loop_end = time.time()
    print("loop time worker:", rank, "nprocs:", world_size, loop_end - loop_start, "s")

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == "__main__":
    for n_procs in range(6):
        run_demo(main, n_procs+1)
  1. Results in:

loop time worker: 0 nprocs: 1 4.527710199356079 s loop time worker: 0 nprocs: 2 5.222243547439575 s loop time worker: 1 nprocs: 2 5.496668577194214 s loop time worker: 2 nprocs: 3 7.741548776626587 s loop time worker: 0 nprocs: 3 7.951193809509277 s loop time worker: 1 nprocs: 3 8.345336198806763 s loop time worker: 0 nprocs: 4 12.702503442764282 s loop time worker: 1 nprocs: 4 13.029954195022583 s loop time worker: 3 nprocs: 4 13.766679286956787 s loop time worker: 2 nprocs: 4 13.906413793563843 s loop time worker: 0 nprocs: 5 16.707841873168945 s loop time worker: 3 nprocs: 5 17.413241863250732 s loop time worker: 2 nprocs: 5 17.777271032333374 s loop time worker: 4 nprocs: 5 17.782108306884766 s loop time worker: 1 nprocs: 5 17.930848836898804 s loop time worker: 0 nprocs: 6 17.15580153465271 s loop time worker: 2 nprocs: 6 17.536866664886475 s loop time worker: 3 nprocs: 6 18.087643146514893 s loop time worker: 4 nprocs: 6 18.502392053604126 s loop time worker: 5 nprocs: 6 18.63866925239563 s loop time worker: 1 nprocs: 6 19.046801805496216 s

Expected behavior

Loop time staying roughly constant for increasing amount of workers.

Environment

PyTorch version: 1.3.1 Is debug build: No CUDA used to build PyTorch: 10.1.243

OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: Could not collect

Python version: 3.6 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA

Versions of relevant libraries: [pip] numpy==1.17.4 [pip] torch==1.3.1 [conda] torch 1.3.1 pypi_0 pypi

Additional context

Replacing the call to policy with time.sleep(0.005) yields the timing behavior I was initially expecting. Why does the cost of calling policy go up as the number of workers increase?

cc @VitalyFedyunin @ngimel @mruberry

mruberry commented 4 years ago

@timtody It's hard to diagnose every model, but it's not clear you should expect a significant speedup in this case.

Can you look at CPU utilization when you're running with a single process vs. when you're running with multiple processes? Maybe the single process is using multiple cores.