Open rednoah91 opened 4 months ago
Hi thanks for trying.
It seems the pytorch install link you used is outdated, that's why you only got up to dev20240605
version.
And there are some API changes between 0605 and our release version in torch 2.4.
Can you please try this link instead if you want to use nightly version?
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
Or you could install the release version if you want it to be stable. (Removing --pre
flag from above).
Note:
our library currently only supports CUDA device, hence the cu121
in the above link. CPU support is not stable due to weak P2P support of Gloo.
cc: @wconstab @H-Huang
Thanks for the information.
I've tried pip install https://download.pytorch.org/whl/nightly/cpu/torch-2.4.0.dev20240612%2Bcpu-cp310-cp310-linux_x86_64.whl
on my ubuntu machine and it works.
Note: our library currently only supports CUDA device, hence the cu121 in the above link. CPU support is not stable due to weak P2P support of Gloo.
I ran pippy_gpt2.py
passed, but pippy_bert.py
got the following error: (looks something is stuck untill timeout)
Maybe this echo to what you are saying the Gloo support is weak(?). Do you have plan to complement the CPU support?
Pipeline stage 1 21M params
Pipeline stage 3 21M params
Pipeline stage 2 21M params
Pipeline stage 0 45M params
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank0]: run(args)
[rank0]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 72, in run
[rank0]: schedule.step(**inputs)
[rank0]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank0]: self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank0]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 384, in _step_microbatches
[rank0]: work.wait()
[rank0]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [127.0.1.1]:30124: Connection reset by peer
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank2]: run(args)
[rank2]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank2]: out = schedule.step()
[rank2]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank2]: self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank2]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank2]: work.wait()
[rank2]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:30124
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank1]: run(args)
[rank1]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank1]: out = schedule.step()
[rank1]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank1]: self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank1]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank1]: work.wait()
[rank1]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank3]: run(args)
[rank3]: File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank3]: out = schedule.step()
[rank3]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank3]: self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank3]: File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank3]: work.wait()
[rank3]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:29850
Yeah, improving CPU support makes sense, though I believe there would be quite some underlying work to do. If you could share your use case, maybe we can better prioritize it. Cc: @H-Huang @wconstab
Hi @kwen2501 sorry for the late reply. Our use case is running LLM inference across multiple cpu-based clusters. Could you tell me what is missing in Gloo? How about the MPI support for CPU?
Got it, thanks.
PyTorch c10d has a python level API batch_isend_irecv
which executes multiple sends and recvs concurrently. Currently this API does not have a stable implementation with Gloo (can hang). The pipelining library uses batch_isend_irecv
as its underlying communication method.
@kwen2501 Can I open another issue in pytorch github for tracking the CPU Gloo hang?
if you can make a repro, go ahead and open the issue.
Hi,
I followed the instructions to install pytorch for pipelining:
I have version
2.4.0.dev20240605+cpu
installed. When I executetorchrun --nproc-per-node 4 pippy_gpt2.py
I got errors:Seems the arguments of pipeline are not matched. Did you ever encounter this? Which pytorch version are you using?
Thanks Hong-Rong