Open hakob-petro opened 1 year ago
How are you running the pippy training example?
Hey @hpc-unex, I am using torchrun
, namely
# On the first node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py
# On the second node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py
P.S.: I have two machines, each with 4 NVIDIA T4 GPUs
I'm running as the example. Calling with sbatch -> srun -> python, and im not having this issue since training on CPU is correctly working. On the other hand, with GPU training I got RPC problems:
1: [W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Also, I have the same problem as you for pippy training. Did you find a solution?
Also, I have the same problem as you for pippy training. Did you find a solution?
Hi @hpc-unex, unfortunately, not yet
I'm experimenting with pipelined training example of ResNet (
pippy_resnet.py
) from https://github.com/pytorch/PiPPy/tree/main/examples/resnet. Namely, I want to compare the loss when running locally on one GPU and when running usingpippy
. I add some basic WandB monitoring.So, when running locally, everything is ok, it is clear that the loss drops, and the accuracy increases for each new epoch. But as soon as I switch to the example using
pippy
, everything changes tremendously. Now the model simply does not train, the loss does not fall, and the accuracy is always around 0.1-0.2.I would be very grateful if someone would explain why this is happening or what I'm doing wrong?
A piece of logs from the terminal from [local training]
A piece of logs from the terminal from [
pippy
training]