[Question] Is the current implementation efficient?

Hi,

I have a question about the order of cutting the model.

In the pippy_llama.py, the model is first moved to all the devices with the full copy, and then cut it. This does not really solve the problem that model can not fit into one device, right? A more effective way would be load the model to say CPU, and partition, then move only part of the model to the devices.

Let me know if my understanding is correct, and if this is how it implemented on other cases.

Thanks!

pytorch / PiPPy

[Question] Is the current implementation efficient? #1144