msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

Actual results did not match the optimizer expectation #50

Open nirandaperera opened 4 years ago

nirandaperera commented 4 years ago

Hi,

I have been running resnet101 with batch size 64 on straight pipeline with 4 GPUs.

I ran the following commands for the profiler and the optimizer.

CUDA_VISIBLE_DEVICES=4 python main.py -a "resnet101"  -b 64 --data_dir "$HOME/data/imagenet-mini/" --profile_directory "profiles1/64"

python optimizer_graph_hierarchical.py -f "../profiler/image_classification/profiles1/64/resnet101/graph.txt" -n 4 -s 11000000000 --straight_pipeline -o "./optim/64/resnet101/gpus=4_straight" -b 2500000000 --use_memory_constraint
python convert_graph_to_model.py -f "./optim/64/resnet101/gpus=4_straight/gpus=4.txt" -n resnet101 -a resnet101 -o "./optim/64/resnet101/gpus=4_straight/"

On the optimizer step, I get the following output.

Time taken by single-stage pipeline: 2.0447990000000003
Time per stage in pipeline: 0.5839989999999989
Throughput increase (compared to single machine): 3.5013741461886134
[Note that single-machine and (4)-machine DP might not fit given memory constraints]
Throughput increase of (4)-machine DP compared to single machine: 3.62130052703585
Throughput increase (compared to (4)-machine DP): 0.966883063155932

So, my expectation was the straight pipeline would be roughly similar to the DP timings.

But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.

        model  batch     conf         mean  speed_up
21  resnet101     64   1_conf  1098.136000  1.000000
22  resnet101     64  mp_conf   770.499250  1.425227
23  resnet101     64  dp_conf   304.383375  3.607740

I'd be very grateful if you could help me, figuring out this discrepancy?

I have drawn the gannt charts for, pipeline https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing data parallel https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing Each h-bar represents the time period from start to end of fwd(or bwd) annotated with + (or -).

It looks to me that each stage is stagnated on the comms for a considerable period of time.

deepakn94 commented 4 years ago

The main reason is that the Gloo peer-to-peer communication primitives are not well optimized. I am hopeful that this problem will at least partially go away when the PyTorch folks upstream the NCCL send and recv primitives, and we can potentially switch to using NCCL throughout.