pytorch / PiPPy

Pipeline Parallelism for PyTorch
BSD 3-Clause "New" or "Revised" License
726 stars 86 forks source link

`pipeline` arguments are not matched #1130

Open rednoah91 opened 4 months ago

rednoah91 commented 4 months ago

Hi,

I followed the instructions to install pytorch for pipelining:

pip install -r requirements.txt --find-links https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

I have version 2.4.0.dev20240605+cpu installed. When I execute torchrun --nproc-per-node 4 pippy_gpt2.py I got errors:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_gpt2.py", line 118, in <module>
[rank0]:     run(args)
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_gpt2.py", line 50, in run
[rank0]:     pipe = pipeline(
[rank0]: TypeError: pipeline() got an unexpected keyword argument 'mb_args'

Seems the arguments of pipeline are not matched. Did you ever encounter this? Which pytorch version are you using?

Thanks Hong-Rong

kwen2501 commented 4 months ago

Hi thanks for trying. It seems the pytorch install link you used is outdated, that's why you only got up to dev20240605 version. And there are some API changes between 0605 and our release version in torch 2.4. Can you please try this link instead if you want to use nightly version?

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121

Or you could install the release version if you want it to be stable. (Removing --pre flag from above).

Note: our library currently only supports CUDA device, hence the cu121 in the above link. CPU support is not stable due to weak P2P support of Gloo.

cc: @wconstab @H-Huang

rednoah91 commented 4 months ago

Thanks for the information. I've tried pip install https://download.pytorch.org/whl/nightly/cpu/torch-2.4.0.dev20240612%2Bcpu-cp310-cp310-linux_x86_64.whl on my ubuntu machine and it works.

rednoah91 commented 4 months ago

Note: our library currently only supports CUDA device, hence the cu121 in the above link. CPU support is not stable due to weak P2P support of Gloo.

I ran pippy_gpt2.py passed, but pippy_bert.py got the following error: (looks something is stuck untill timeout) Maybe this echo to what you are saying the Gloo support is weak(?). Do you have plan to complement the CPU support?

Pipeline stage 1 21M params
Pipeline stage 3 21M params
Pipeline stage 2 21M params
Pipeline stage 0 45M params
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank0]:     run(args)
[rank0]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 72, in run
[rank0]:     schedule.step(**inputs)
[rank0]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank0]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank0]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 384, in _step_microbatches
[rank0]:     work.wait()
[rank0]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [127.0.1.1]:30124: Connection reset by peer
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank2]:     run(args)
[rank2]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank2]:     out = schedule.step()
[rank2]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank2]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank2]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank2]:     work.wait()
[rank2]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:30124
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank1]:     run(args)
[rank1]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank1]:     out = schedule.step()
[rank1]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank1]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank1]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank1]:     work.wait()
[rank1]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 108, in <module>
[rank3]:     run(args)
[rank3]:   File "/home/hongrongh/PiPPy/examples/huggingface/pippy_bert.py", line 74, in run
[rank3]:     out = schedule.step()
[rank3]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 326, in step
[rank3]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank3]:   File "/home/hongrongh/PiPPy/myvenv/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 366, in _step_microbatches
[rank3]:     work.wait()
[rank3]: RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:29850
kwen2501 commented 4 months ago

Yeah, improving CPU support makes sense, though I believe there would be quite some underlying work to do. If you could share your use case, maybe we can better prioritize it. Cc: @H-Huang @wconstab

rednoah91 commented 3 months ago

Hi @kwen2501 sorry for the late reply. Our use case is running LLM inference across multiple cpu-based clusters. Could you tell me what is missing in Gloo? How about the MPI support for CPU?

kwen2501 commented 3 months ago

Got it, thanks. PyTorch c10d has a python level API batch_isend_irecv which executes multiple sends and recvs concurrently. Currently this API does not have a stable implementation with Gloo (can hang). The pipelining library uses batch_isend_irecv as its underlying communication method.

rednoah91 commented 3 months ago

@kwen2501 Can I open another issue in pytorch github for tracking the CPU Gloo hang?

wconstab commented 3 months ago

if you can make a repro, go ahead and open the issue.