Open timtody opened 4 years ago
@timtody It's hard to diagnose every model, but it's not clear you should expect a significant speedup in this case.
Can you look at CPU utilization when you're running with a single process vs. when you're running with multiple processes? Maybe the single process is using multiple cores.
🐛 Bug
Following the tutorial from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html I implemented a distributed policy gradient reinforcement learning algorithm. Using the script below I benchmarked 1000 steps on a simple gym environment and recorded the time per worker. Since I'm on a six core machine I was expecting a nontrivial speedup per global step in the order of 1 <= num_processes <= 6. This is not the case (see output below).
I'm not trying to do any learning here, just benchmarking. I observed the same behavior when using DDP and even asynchonous code from the hogwild example at https://pytorch.org/docs/stable/notes/multiprocessing.html .
To Reproduce
Steps to reproduce the behavior:
loop time worker: 0 nprocs: 1 4.527710199356079 s loop time worker: 0 nprocs: 2 5.222243547439575 s loop time worker: 1 nprocs: 2 5.496668577194214 s loop time worker: 2 nprocs: 3 7.741548776626587 s loop time worker: 0 nprocs: 3 7.951193809509277 s loop time worker: 1 nprocs: 3 8.345336198806763 s loop time worker: 0 nprocs: 4 12.702503442764282 s loop time worker: 1 nprocs: 4 13.029954195022583 s loop time worker: 3 nprocs: 4 13.766679286956787 s loop time worker: 2 nprocs: 4 13.906413793563843 s loop time worker: 0 nprocs: 5 16.707841873168945 s loop time worker: 3 nprocs: 5 17.413241863250732 s loop time worker: 2 nprocs: 5 17.777271032333374 s loop time worker: 4 nprocs: 5 17.782108306884766 s loop time worker: 1 nprocs: 5 17.930848836898804 s loop time worker: 0 nprocs: 6 17.15580153465271 s loop time worker: 2 nprocs: 6 17.536866664886475 s loop time worker: 3 nprocs: 6 18.087643146514893 s loop time worker: 4 nprocs: 6 18.502392053604126 s loop time worker: 5 nprocs: 6 18.63866925239563 s loop time worker: 1 nprocs: 6 19.046801805496216 s
Expected behavior
Loop time staying roughly constant for increasing amount of workers.
Environment
PyTorch version: 1.3.1 Is debug build: No CUDA used to build PyTorch: 10.1.243
OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: Could not collect
Python version: 3.6 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA
Versions of relevant libraries: [pip] numpy==1.17.4 [pip] torch==1.3.1 [conda] torch 1.3.1 pypi_0 pypi
Additional context
Replacing the call to policy with time.sleep(0.005) yields the timing behavior I was initially expecting. Why does the cost of calling policy go up as the number of workers increase?
cc @VitalyFedyunin @ngimel @mruberry