pytorch / PiPPy

Pipeline Parallelism for PyTorch
BSD 3-Clause "New" or "Revised" License
724 stars 86 forks source link

[Question] Is the current implementation efficient? #1144

Closed jq-wei closed 2 hours ago

jq-wei commented 2 hours ago

Hi,

I have a question about the order of cutting the model.

In the pippy_llama.py, the model is first moved to all the devices with the full copy, and then cut it. This does not really solve the problem that model can not fit into one device, right? A more effective way would be load the model to say CPU, and partition, then move only part of the model to the devices.

Let me know if my understanding is correct, and if this is how it implemented on other cases.

Thanks!

jq-wei commented 2 hours ago

I found the cpu_init.