This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance:
1) support rectangular blocks
2) other configurations in how we post the WQEs:
iterate across nodes before blocks when posting the WQEs
reuse dm chunks
send blocks by batch
3) knomial fan-in for the internode sync
We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.
TODO:
One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.
What
This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance: 1) support rectangular blocks 2) other configurations in how we post the WQEs:
We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.
TODO:
One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.
cc @lappazos @x41lakazam