pytorch / PiPPy

Pipeline Parallelism for PyTorch
BSD 3-Clause "New" or "Revised" License
726 stars 86 forks source link

ResNet example always underfitting when pippy training #864

Open hakob-petro opened 1 year ago

hakob-petro commented 1 year ago

I'm experimenting with pipelined training example of ResNet (pippy_resnet.py) from https://github.com/pytorch/PiPPy/tree/main/examples/resnet. Namely, I want to compare the loss when running locally on one GPU and when running using pippy. I add some basic WandB monitoring.

So, when running locally, everything is ok, it is clear that the loss drops, and the accuracy increases for each new epoch. But as soon as I switch to the example using pippy, everything changes tremendously. Now the model simply does not train, the loss does not fall, and the accuracy is always around 0.1-0.2.

I would be very grateful if someone would explain why this is happening or what I'm doing wrong?

A piece of logs from the terminal from [local training]

Using device: cuda
Files already downloaded and verified
Epoch: 1
100%|█████████████████████████████████████| 500/500 [02:36<00:00,  3.20it/s]
Loader: train. Accuracy: 0.44814
100%|█████████████████████████████████████| 100/100 [00:10<00:00,  9.73it/s]
Loader: valid. Accuracy: 0.5681
Epoch: 2
100%|█████████████████████████████████████| 500/500 [02:36<00:00,  3.19it/s]
Loader: train. Accuracy: 0.61832
100%|█████████████████████████████████████| 100/100 [00:10<00:00,  9.20it/s]
Loader: valid. Accuracy: 0.622
Epoch: 3
100%|█████████████████████████████████████| 500/500 [02:44<00:00,  3.04it/s]
Loader: train. Accuracy: 0.69844
100%|█████████████████████████████████████| 100/100 [00:10<00:00,  9.20it/s]
Loader: valid. Accuracy: 0.6618

A piece of logs from the terminal from [pippy training]

Epoch: 1
100%|███████████████████████████████████| 1250/1250 [02:34<00:00,  8.07it/s]
Loader: train. Accuracy: 0.1428
100%|█████████████████████████████████████| 250/250 [00:14<00:00, 17.49it/s]
Loader: valid. Accuracy: 0.106
Epoch: 2
100%|███████████████████████████████████| 1250/1250 [02:50<00:00,  7.32it/s]
Loader: train. Accuracy: 0.14936
100%|█████████████████████████████████████| 250/250 [00:19<00:00, 13.12it/s]
Loader: valid. Accuracy: 0.1333
Epoch: 3
100%|███████████████████████████████████| 1250/1250 [03:03<00:00,  6.83it/s]
Loader: train. Accuracy: 0.13552
100%|█████████████████████████████████████| 250/250 [00:19<00:00, 12.51it/s]
Loader: valid. Accuracy: 0.1225
hpc-unex commented 1 year ago

How are you running the pippy training example?

hakob-petro commented 1 year ago

Hey @hpc-unex, I am using torchrun, namely

# On the first node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py

# On the second node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py

P.S.: I have two machines, each with 4 NVIDIA T4 GPUs

hpc-unex commented 1 year ago

I'm running as the example. Calling with sbatch -> srun -> python, and im not having this issue since training on CPU is correctly working. On the other hand, with GPU training I got RPC problems:

1: [W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

hpc-unex commented 1 year ago

Also, I have the same problem as you for pippy training. Did you find a solution?

hakob-petro commented 1 year ago

Also, I have the same problem as you for pippy training. Did you find a solution?

Hi @hpc-unex, unfortunately, not yet