openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

Wrong data received when sharing a team with multiple threads #979

Open nirandaperera opened 1 month ago

nirandaperera commented 1 month ago

Hi, I have the following setup.

I have multiple threads issuing a set of tagged allgather operations (with callbacks) to the team. I am keeping the send and receive buffers alive in the data object of the callback. It seems the receive buffers are receiving wrong data from the parallel allgather operations. It's working fine for a single thread.

Can someone please help me out here?

Is sharing a team with multiple threads, an anti-pattern? I checked the allgather test code, and it seemed to me that there was a separate team for every test proc.

nirandaperera commented 1 month ago

I am seeing several Message truncated errors in the UCC log.

6479650.125378] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126129] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126132] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126893] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126895] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.127440] [XPS-15-9510:3236166:7]          ec_cpu.c:186  cpu ec DEBUG executor finalize, eee: 0x584f0ee37600
[1716479650.127449] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
janjust commented 1 month ago

@nirandaperera Just going out on a limb and say that UCX was also built with thread-multiple, correct?

janjust commented 1 month ago

@nirandaperera can you paste a reproducer?

nirandaperera commented 1 month ago

@janjust I was using conda ucx 1.15.0

janjust commented 1 month ago

I'm assuming it's already built with --thread-multiple, can you check?

$ ucx_info -v
# Library version: 1.17.0
# Library path: /hpc/local/benchmarks/daily/next/2024-05-22/hpcx-gcc-redhat7/ucx/mt/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'v1.17.x', revision b67bc34
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.4.0/redhat7 --with-gdrcopy --prefix=/build-result/hpcx-gcc-redhat7/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37

Should say --enable-mt

nirandaperera commented 1 month ago

@janjust yes it does

janjust commented 1 month ago

Do you have a simple reproducer handy and we'll take a look. We did something similar to what you're doing with AR but changed to 1 ucc_context/thread due to performance, not correctness

janjust commented 1 month ago

Just to add onto this: from what i'm hearing we've never tried this model of multitple threads/team, and to me it seems to mimic multiple threads/mpi_comm which would be against the spec.

What I suspect is happening the threads are progressing each other's callbacks because the context is shared thus each thread picks up the task in no particular order (this is me guessing over what we've observed in our case).

janjust commented 1 month ago

@nsarka Can you please look at this thread - does what I'm saying make sense, is this what you observed with sliding-window?

nirandaperera commented 1 month ago

@janjust FWIW I am using tagged collectives

nirandaperera commented 1 month ago

@janjust There's a separate thread that progresses the ctx. From what I understood from the source code that all teams will be ultimately enqueing tasks to a single progress queue in the context, isn't it?

janjust commented 1 month ago

yes, I believe that's correct

nirandaperera commented 1 month ago

@janjust I was using v1.2 I just tried my code with the ucc master branch build and I don't see this error anymore. I think previously allgather knomial algorithm was active. But FWIU in the master branch, there are two new algorithms,

    3 :            bruck : O(log(N)) Variation of Bruck algorithm
    4 :          sparbit : O(log(N)) SPARBIT algorithm

I think one of these is now running the operation. (I tried v1.3 as well & it had the same issue)

nirandaperera commented 1 month ago

Sorry, I misspoke. I put some logs in the code and found out that it's the knomial algorithm in both cases.

Sergei-Lebedev commented 1 month ago

there was a fix for multithreaded case, it was merged into master but we didn't include it into v1.3.x branch https://github.com/openucx/ucc/pull/932

nirandaperera commented 1 month ago

@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926

I think this change might be the fix. https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376

janjust commented 1 month ago

hmm, odd, that pr had nothing to do with multi-threading

Sergei-Lebedev commented 1 month ago

@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926

I think this change might be the fix. https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376

yes, could be

nirandaperera commented 1 month ago

@janjust I think the tags have not been propagated properly until this PR. :slightly_smiling_face:

nirandaperera commented 1 month ago

@Sergei-Lebedev is there a way to get a patch release for v1.3?

Sergei-Lebedev commented 1 month ago

@Sergei-Lebedev is there a way to get a patch release for v1.3?

@manjugv wdyt?

nirandaperera commented 1 month ago

@Sergei-Lebedev it might need a proper test case for this scenario anyways, I guess.