openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

Assertion `worker->inprogress++ == 0' failed #10039

Open pereverges opened 4 months ago

pereverges commented 4 months ago

Describe the bug

I have compiled the code in my laptop and there it executes perfectly, however when I port the code to a server I am sometimes running into this error, however this does not happen always. I am not sure when this error arises.

[gs07r1b29:3935050:2:3935480] ucp_worker.c:2990 Assertion `worker->inprogress++ == 0' failed backtrace (tid:3935480) ==== 0 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_handle_error+0x3f4) [0x7f5584b05704] 1 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_message+0xec) [0x7f5584b02b9c] 2 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_format+0x103) [0x7f5584b02aa3] 3 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucp.so.0.0.0(ucp_worker_progress+0x1a3) [0x7f5552cfb433] 4 [0x7f5515415e5b]

[gs07r1b29:3935050:1:3935478] ucp_worker.c:2995 Assertion `--worker->inprogress == 0' failed backtrace (tid:3935478) ==== 0 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_handle_error+0x3f4) [0x7f5584b05704] 1 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_message+0xec) [0x7f5584b02b9c] 2 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_format+0x103) [0x7f5584b02aa3] 3 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucp.so.0.0.0(ucp_worker_progress+0xd3) [0x7f5552cfb363] 4 [0x7f5515415e5b]

Steps to Reproduce

Executing an application that involves send stream / receive stream using jucx, and follows a structure similar to the UCXBenchmark

Setup and versions

yosefe commented 3 months ago

Seems an issue with enabling multi-threading support. If the application is multi-threaded, UCX has to be compiled with multi-thread support (--enable-mt) and ucp_worker_create has to be called with ucp_worker_params_t::thread_mode= UCS_THREAD_MODE_MULTI

pereverges commented 3 months ago

I am using jucx, the Java binding, how do I have to call it in that case?

On Mon, Aug 5, 2024 at 9:17 AM Yossi Itigin @.***> wrote:

Seems an issue with enabling multi-threading support. If the application is multi-threaded, UCX has to be compiled with multi-thread support (--enable-mt) and ucp_worker_create has to be called with ucp_worker_params_t::thread_mode= UCS_THREAD_MODE_MULTI

— Reply to this email directly, view it on GitHub https://github.com/openucx/ucx/issues/10039#issuecomment-2269443124, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGUSLZ3367LSDPKVXWQ4HLZP6QTPAVCNFSM6AAAAABL3ZQNPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZGQ2DGMJSGQ . You are receiving this because you authored the thread.Message ID: @.***>

yosefe commented 3 months ago

I am using jucx, the Java binding, how do I have to call it in that case? On Mon, Aug 5, 2024 at 9:17 AM Yossi Itigin @.> wrote: Seems an issue with enabling multi-threading support. If the application is multi-threaded, UCX has to be compiled with multi-thread support (--enable-mt) and ucp_worker_create has to be called with ucp_worker_params_t::thread_mode= UCS_THREAD_MODE_MULTI — Reply to this email directly, view it on GitHub <#10039 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGUSLZ3367LSDPKVXWQ4HLZP6QTPAVCNFSM6AAAAABL3ZQNPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZGQ2DGMJSGQ . You are receiving this because you authored the thread.Message ID: @.>

See https://github.com/openucx/ucx/blob/master/bindings/java/src/test/java/org/openucx/jucx/UcpWorkerTest.java#L41 - requestThreadSafety