Closed nsarka closed 3 weeks ago
Is it ready for review? If so, please add the label
It is ready for review. I’m OOO, so I can’t add the label. Please let me know your opinion though on my question about supporting non-rooted collectivesOn Jul 18, 2024, at 7:27 AM, samnordmann @.***> wrote: Is it ready for review? If so, please add the label
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
- The CI error looks relevant
Thanks, I will take a look
- For my own curiosity: Is there a concrete motivation for this patch? I mean, is it a user's request, or do we only think it is a nice feature to have?
A hang was reported by the HPC SDK team (I think that's the name, I'm not sure, I heard it from Tommy) in the case of asymmetric memory in a rooted collective. Consider this case: non-root ranks are UCC_MEMORY_TYPE_HOST, root rank is src=HOST dst=CUDA. Before this patch, the root will exit because the memory types are asymmetric. However, non-roots do not know that the root has exited. So there is a hang.
- Can we hope (in the future) to have a more efficient way to handle asymmetric memory, or does it need to be performed, as here, by emulating symmetric memory with a scratch buffer and copying in/out to the user's buffer?
I believe ucp can handle mismatched memory without any problems which is why I was wondering about implementing a copy-out mechanism like this for non-rooted collectives. I will test it and report back. For rooted collectives, I'm not sure yet if there's a better way.
Hi @Sergei-Lebedev , I have addressed all of your comments. For scatter and scatterv, a new src buf will be allocated and copied into. To support this as a persistent collective I moved the copy-in to ucc_collective_post. Then I added a persistent gtest that will:
CI is failing with:
failed to register layer: write libcupti.so.11.3:
no space left on device
Error: Docker pull failed with exit code 1
ucc test had a jenkins issue according to Artem
bot:retest
@nsarka LGTM, please fix commit messages to pass codestyle check
The CI is failing on:
[2024-09-03T23:44:41.401Z] [----------] 24 tests from test_tl_mlx5_dm
[2024-09-03T23:44:41.401Z] [ RUN ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0
[2024-09-03T23:44:41.401Z] [swx-clx01:452 :0:452] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[2024-09-03T23:44:41.401Z] ==== backtrace (tid: 452) ====
[2024-09-03T23:44:41.401Z] 0 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f173e614564]
[2024-09-03T23:44:41.401Z] 1 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7f173e61475f]
[2024-09-03T23:44:41.401Z] 2 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7f173e614a46]
[2024-09-03T23:44:41.401Z] 3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f173dd06520]
[2024-09-03T23:44:41.401Z] 4 /usr/lib/x86_64-linux-gnu/libibverbs.so.1(ibv_dereg_mr+0x25) [0x7f173e59c115]
[2024-09-03T23:44:41.401Z] 5 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x19916c9) [0x55d6cea2e6c9]
[2024-09-03T23:44:41.401Z] 6 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x57ed91) [0x55d6cd61bd91]
[2024-09-03T23:44:41.401Z] 7 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5726e2) [0x55d6cd60f6e2]
[2024-09-03T23:44:41.401Z] 8 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x57296e) [0x55d6cd60f96e]
[2024-09-03T23:44:41.401Z] 9 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x573a49) [0x55d6cd610a49]
[2024-09-03T23:44:41.401Z] 10 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x573f48) [0x55d6cd610f48]
[2024-09-03T23:44:41.401Z] 11 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x537d02) [0x55d6cd5d4d02]
[2024-09-03T23:44:41.401Z] 12 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f173dcedd90]
[2024-09-03T23:44:41.401Z] 13 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f173dcede40]
[2024-09-03T23:44:41.401Z] 14 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x54ba95) [0x55d6cd5e8a95]
[2024-09-03T23:44:41.401Z] =================================
[2024-09-03T23:44:43.291Z] make[1]: *** [Makefile:2030: test] Segmentation fault (core dumped)
[2024-09-03T23:44:43.291Z] make[1]: Leaving directory '/opt/nvidia/src/ucc/build/test/gtest'
[2024-09-03T23:44:43.291Z] make: *** [Makefile:1001: gtest] Error 2
This seems unrelated to my changes
In our UCC call, we defined two types of "asymmetric memory":
Since strong asymmetric memory would require an allreduce for every
ucc_collective_init
, this PR focuses on weak asymmetric memory.Previously, what would happen is when the src/dst mismatch we can see that they mismatch on every rank, so everybody errors out saying we don't support asymmetric memory. But on rooted collectives, the root will do the same, but on non-root ranks there is only one src OR dst buffer, so there's no way to know that the root is going to error out. This caused a hang.
So, this PR will enable asymmetric memory for rooted collectives. On the root, it assumes the src buffer is the "true" mem_type, and makes a scratch allocation for the dst buffer of size
dst.info.count * dt_size(dst.info.mem_type)
with mem_typesrc.info.mem_type
. After this, it will run the collective. Once the collective completes, it will copy out from the scratch allocation into the old dst buffer.@Sergei-Lebedev My only remaining question is what should we do with non-rooted collectives? Should they still error out? Or should this apply to them too?