Open nirandaperera opened 1 month ago
I am seeing several Message truncated errors in the UCC log.
6479650.125378] [XPS-15-9510:3236166:7] tl_ucp_coll.c:133 TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126129] [XPS-15-9510:3236166:7] ucp_request.c:748 UCX DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126132] [XPS-15-9510:3236166:7] tl_ucp_coll.c:133 TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126893] [XPS-15-9510:3236166:7] ucp_request.c:748 UCX DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126895] [XPS-15-9510:3236166:7] tl_ucp_coll.c:133 TL_UCP ERROR failure in recv completion Message truncated
[1716479650.127440] [XPS-15-9510:3236166:7] ec_cpu.c:186 cpu ec DEBUG executor finalize, eee: 0x584f0ee37600
[1716479650.127449] [XPS-15-9510:3236166:7] ucp_request.c:748 UCX DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
@nirandaperera Just going out on a limb and say that UCX was also built with thread-multiple, correct?
@nirandaperera can you paste a reproducer?
@janjust I was using conda ucx 1.15.0
I'm assuming it's already built with --thread-multiple, can you check?
$ ucx_info -v
# Library version: 1.17.0
# Library path: /hpc/local/benchmarks/daily/next/2024-05-22/hpcx-gcc-redhat7/ucx/mt/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'v1.17.x', revision b67bc34
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.4.0/redhat7 --with-gdrcopy --prefix=/build-result/hpcx-gcc-redhat7/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37
Should say
--enable-mt
@janjust yes it does
Do you have a simple reproducer handy and we'll take a look. We did something similar to what you're doing with AR but changed to 1 ucc_context/thread due to performance, not correctness
Just to add onto this: from what i'm hearing we've never tried this model of multitple threads/team, and to me it seems to mimic multiple threads/mpi_comm which would be against the spec.
What I suspect is happening the threads are progressing each other's callbacks because the context is shared thus each thread picks up the task in no particular order (this is me guessing over what we've observed in our case).
@nsarka Can you please look at this thread - does what I'm saying make sense, is this what you observed with sliding-window?
@janjust FWIW I am using tagged collectives
@janjust There's a separate thread that progresses the ctx. From what I understood from the source code that all teams will be ultimately enqueing tasks to a single progress queue in the context, isn't it?
yes, I believe that's correct
@janjust I was using v1.2 I just tried my code with the ucc master branch build and I don't see this error anymore. I think previously allgather knomial algorithm was active. But FWIU in the master branch, there are two new algorithms,
3 : bruck : O(log(N)) Variation of Bruck algorithm
4 : sparbit : O(log(N)) SPARBIT algorithm
I think one of these is now running the operation. (I tried v1.3 as well & it had the same issue)
Sorry, I misspoke. I put some logs in the code and found out that it's the knomial algorithm in both cases.
there was a fix for multithreaded case, it was merged into master but we didn't include it into v1.3.x branch https://github.com/openucx/ucc/pull/932
@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926
I think this change might be the fix. https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376
hmm, odd, that pr had nothing to do with multi-threading
@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926
I think this change might be the fix. https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376
yes, could be
@janjust I think the tags have not been propagated properly until this PR. :slightly_smiling_face:
@Sergei-Lebedev is there a way to get a patch release for v1.3?
@Sergei-Lebedev is there a way to get a patch release for v1.3?
@manjugv wdyt?
@Sergei-Lebedev it might need a proper test case for this scenario anyways, I guess.
Hi, I have the following setup.
UCC_THREAD_MULTIPLE
UCC_CONTEXT_SHARED
UCC_COLLECTIVE_INIT_AND_POST_UNORDERED
andUCC_NO_SYNC_COLLECTIVES
(IINM these configs are not used underneath)I have multiple threads issuing a set of tagged allgather operations (with callbacks) to the team. I am keeping the send and receive buffers alive in the
data
object of the callback. It seems the receive buffers are receiving wrong data from the parallel allgather operations. It's working fine for a single thread.Can someone please help me out here?
Is sharing a team with multiple threads, an anti-pattern? I checked the allgather test code, and it seemed to me that there was a separate team for every test proc.