openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

TL/MLX5: add device mem mcast bcast #989

Open MamziB opened 2 weeks ago

MamziB commented 2 weeks ago

TL/MLX5: add device mem mcast bcast

MamziB commented 1 week ago
  • Please make sure building ucc is possible even without cuda support

Thanks! Can you please address the following comment?

1- Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code Please refer to https://github.com/openucx/ucc/pull/989#discussion_r1643299598

Please make sure building ucc is possible even without cuda support Sure will do

Please fix the CI issues Sure, will do

Can you add test for this feature? Sure, I will open new PR for it

MamziB commented 1 week ago

Hi @samnordmann Thanks for the constructive comments. I have added a new commit. Please let me know if you have more comments.

samnordmann commented 1 week ago

Thanks! Can you please address the following comment?

Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here https://github.com/openucx/ucc/blob/b2d5a789e31339078578fff77ba451566fbebe8e/src/components/tl/mlx5/tl_mlx5_team.c#L299

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code

Please refer to #989 (comment)

I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.

Please make sure building ucc is possible even without cuda support

Sure will do

This comment is not addressed. Every cuda API call should be decorated with appropriate compilator guard. (another advantage of using mc). This is related to the error seen in the CI. The program should compile with a configure command of the type configure --with-tls=ucp,mlx5 --without-cuda, otherwise it will be rejected by the tests

Please fix the CI issues

Sure, will do

there are still the same issues

Can you add test for this feature?

Sure, I will open new PR for it

Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI