openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
195 stars 96 forks source link

Asymmetric memory #1000

Closed nsarka closed 3 weeks ago

nsarka commented 2 months ago

In our UCC call, we defined two types of "asymmetric memory":

Since strong asymmetric memory would require an allreduce for every ucc_collective_init, this PR focuses on weak asymmetric memory.

Previously, what would happen is when the src/dst mismatch we can see that they mismatch on every rank, so everybody errors out saying we don't support asymmetric memory. But on rooted collectives, the root will do the same, but on non-root ranks there is only one src OR dst buffer, so there's no way to know that the root is going to error out. This caused a hang.

So, this PR will enable asymmetric memory for rooted collectives. On the root, it assumes the src buffer is the "true" mem_type, and makes a scratch allocation for the dst buffer of size dst.info.count * dt_size(dst.info.mem_type) with mem_type src.info.mem_type. After this, it will run the collective. Once the collective completes, it will copy out from the scratch allocation into the old dst buffer.

@Sergei-Lebedev My only remaining question is what should we do with non-rooted collectives? Should they still error out? Or should this apply to them too?

samnordmann commented 2 months ago

Is it ready for review? If so, please add the label

nsarka commented 2 months ago

It is ready for review. I’m OOO, so I can’t add the label. Please let me know your opinion though on my question about supporting non-rooted collectivesOn Jul 18, 2024, at 7:27 AM, samnordmann @.***> wrote: Is it ready for review? If so, please add the label

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

nsarka commented 2 months ago
  1. The CI error looks relevant

Thanks, I will take a look

  1. For my own curiosity: Is there a concrete motivation for this patch? I mean, is it a user's request, or do we only think it is a nice feature to have?

A hang was reported by the HPC SDK team (I think that's the name, I'm not sure, I heard it from Tommy) in the case of asymmetric memory in a rooted collective. Consider this case: non-root ranks are UCC_MEMORY_TYPE_HOST, root rank is src=HOST dst=CUDA. Before this patch, the root will exit because the memory types are asymmetric. However, non-roots do not know that the root has exited. So there is a hang.

  1. Can we hope (in the future) to have a more efficient way to handle asymmetric memory, or does it need to be performed, as here, by emulating symmetric memory with a scratch buffer and copying in/out to the user's buffer?

I believe ucp can handle mismatched memory without any problems which is why I was wondering about implementing a copy-out mechanism like this for non-rooted collectives. I will test it and report back. For rooted collectives, I'm not sure yet if there's a better way.

nsarka commented 1 month ago

Hi @Sergei-Lebedev , I have addressed all of your comments. For scatter and scatterv, a new src buf will be allocated and copied into. To support this as a persistent collective I moved the copy-in to ucc_collective_post. Then I added a persistent gtest that will:

nsarka commented 1 month ago

CI is failing with:

failed to register layer: write libcupti.so.11.3:
no space left on device
Error: Docker pull failed with exit code 1

ucc test had a jenkins issue according to Artem

nsarka commented 1 month ago

bot:retest

Sergei-Lebedev commented 1 month ago

@nsarka LGTM, please fix commit messages to pass codestyle check

nsarka commented 3 weeks ago

The CI is failing on:

[2024-09-03T23:44:41.401Z] [----------] 24 tests from test_tl_mlx5_dm
[2024-09-03T23:44:41.401Z] [ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0
[2024-09-03T23:44:41.401Z] [swx-clx01:452  :0:452] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[2024-09-03T23:44:41.401Z] ==== backtrace (tid:    452) ====
[2024-09-03T23:44:41.401Z]  0  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f173e614564]
[2024-09-03T23:44:41.401Z]  1  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7f173e61475f]
[2024-09-03T23:44:41.401Z]  2  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7f173e614a46]
[2024-09-03T23:44:41.401Z]  3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f173dd06520]
[2024-09-03T23:44:41.401Z]  4  /usr/lib/x86_64-linux-gnu/libibverbs.so.1(ibv_dereg_mr+0x25) [0x7f173e59c115]
[2024-09-03T23:44:41.401Z]  5  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x19916c9) [0x55d6cea2e6c9]
[2024-09-03T23:44:41.401Z]  6  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x57ed91) [0x55d6cd61bd91]
[2024-09-03T23:44:41.401Z]  7  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5726e2) [0x55d6cd60f6e2]
[2024-09-03T23:44:41.401Z]  8  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x57296e) [0x55d6cd60f96e]
[2024-09-03T23:44:41.401Z]  9  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x573a49) [0x55d6cd610a49]
[2024-09-03T23:44:41.401Z] 10  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x573f48) [0x55d6cd610f48]
[2024-09-03T23:44:41.401Z] 11  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x537d02) [0x55d6cd5d4d02]
[2024-09-03T23:44:41.401Z] 12  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f173dcedd90]
[2024-09-03T23:44:41.401Z] 13  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f173dcede40]
[2024-09-03T23:44:41.401Z] 14  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x54ba95) [0x55d6cd5e8a95]
[2024-09-03T23:44:41.401Z] =================================
[2024-09-03T23:44:43.291Z] make[1]: *** [Makefile:2030: test] Segmentation fault (core dumped)
[2024-09-03T23:44:43.291Z] make[1]: Leaving directory '/opt/nvidia/src/ucc/build/test/gtest'
[2024-09-03T23:44:43.291Z] make: *** [Makefile:1001: gtest] Error 2

This seems unrelated to my changes