TL/MLX5: various optimizations

What

This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance: 1) support rectangular blocks 2) other configurations in how we post the WQEs:

iterate across nodes before blocks when posting the WQEs
reuse dm chunks
send blocks by batch 3) knomial fan-in for the internode sync

We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.

TODO:

One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.

cc @lappazos @x41lakazam

openucx / ucc

TL/MLX5: various optimizations #1012

What

TODO: