I was exploring using Tensor Parallel when training. I was wondering if you had any input on the correct use of RowParallelLinear when it comes to the feedforward out.
Normally I would just do Column Parallel, SwiGLU, Row Parallel in a standard FeedForward but it is not super clear to me how to handle this case when it comes to fused attn ff and ff tail.
Hi,
I was exploring using Tensor Parallel when training. I was wondering if you had any input on the correct use of RowParallelLinear when it comes to the feedforward out.
For example:
Column Parallel over q, k, v, and ff inner.
Row Parallel over attn out.
I am not 100% sure whether this should be Row Parallel as well.
Normally I would just do Column Parallel, SwiGLU, Row Parallel in a standard FeedForward but it is not super clear to me how to handle this case when it comes to
fused attn ff
andff tail
.Any input would be greatly appreciated.
Thank you,
Enrico