Open bhuvan777 opened 1 month ago
cc: @H-Huang @wconstab
Hi @bhuvan777, right, we currently only support splitting at the Transformer block level. Splitting at a more granular layer is possible but would just require more code to do so, making the pipeline parallel code a bit more complex. If we split at the block level, then it is simpler to determine what the input/output activations that need to be used in send/recv during pipeline parallelism.
Curious about what is your particular use case for splitting up a block?
When configuring pipeline splitting by specifying exact layers in the config (
--experimental.pipeline_parallel_split_points
), we are unable to assign sub-layers (e.g.,layer.4.attn.qvw
). If we attempt to do so, all layers are allocated to device rank 0, leaving device rank 1 without any layers(considering we are using 2 devices). To avoid this, we need to specify the layer as a whole (e.g.,layer.4
).Is there a specific reason for this limitation, or could we expect support for more granular layer-level assignment (e.g., sub-layers like
layer.4.attn.qvw
) in future updates?