Open MamziB opened 2 weeks ago
- Please make sure building ucc is possible even without cuda support
Thanks! Can you please address the following comment?
1- Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.
If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.
We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code Please refer to https://github.com/openucx/ucc/pull/989#discussion_r1643299598
Please make sure building ucc is possible even without cuda support Sure will do
Please fix the CI issues Sure, will do
Can you add test for this feature? Sure, I will open new PR for it
Hi @samnordmann Thanks for the constructive comments. I have added a new commit. Please let me know if you have more comments.
Thanks! Can you please address the following comment?
Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.
If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.
So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here https://github.com/openucx/ucc/blob/b2d5a789e31339078578fff77ba451566fbebe8e/src/components/tl/mlx5/tl_mlx5_team.c#L299
We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code
Please refer to #989 (comment)
I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.
Please make sure building ucc is possible even without cuda support
Sure will do
This comment is not addressed. Every cuda API call should be decorated with appropriate compilator guard. (another advantage of using mc). This is related to the error seen in the CI. The program should compile with a configure command of the type configure --with-tls=ucp,mlx5 --without-cuda
, otherwise it will be rejected by the tests
Please fix the CI issues
Sure, will do
there are still the same issues
Can you add test for this feature?
Sure, I will open new PR for it
Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI
TL/MLX5: add device mem mcast bcast