The accelerator collective module (which allocates and moves the data onto the host in order to complete collective communications) has a priority higher than some collective modules that do natively support CUDA/ROCM (such as UCC). This leads the terrible performance for most users, for as long as they don't manually exclude the accelerator collective (via --mca coll ^accelerator).
This is definitively not very user-friendly, we need to find a way to prevent the accelerator framework from staying in the way of collective components that handle accelerator buffers.
The accelerator collective module (which allocates and moves the data onto the host in order to complete collective communications) has a priority higher than some collective modules that do natively support CUDA/ROCM (such as UCC). This leads the terrible performance for most users, for as long as they don't manually exclude the accelerator collective (via
--mca coll ^accelerator
).This is definitively not very user-friendly, we need to find a way to prevent the accelerator framework from staying in the way of collective components that handle accelerator buffers.