pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
2.64k stars 205 forks source link

Granular layer selection during Pipeline Parallelism #598

Open bhuvan777 opened 1 month ago

bhuvan777 commented 1 month ago

When configuring pipeline splitting by specifying exact layers in the config (--experimental.pipeline_parallel_split_points), we are unable to assign sub-layers (e.g., layer.4.attn.qvw). If we attempt to do so, all layers are allocated to device rank 0, leaving device rank 1 without any layers(considering we are using 2 devices). To avoid this, we need to specify the layer as a whole (e.g., layer.4).

Is there a specific reason for this limitation, or could we expect support for more granular layer-level assignment (e.g., sub-layers like layer.4.attn.qvw) in future updates?

tianyu-l commented 1 month ago

cc: @H-Huang @wconstab

H-Huang commented 1 month ago

Hi @bhuvan777, right, we currently only support splitting at the Transformer block level. Splitting at a more granular layer is possible but would just require more code to do so, making the pipeline parallel code a bit more complex. If we split at the block level, then it is simpler to determine what the input/output activations that need to be used in send/recv during pipeline parallelism.

Curious about what is your particular use case for splitting up a block?