openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 426 forks source link

Assertion `(arbiter)->guard == 0' failed during ucp Active Message processing #3977

Closed tjcw closed 5 years ago

tjcw commented 5 years ago

I am porting an application from IBM PAMI to UCX. The application uses ucp active messages. I am running on 1 node with 4 processes. I am seeing an assertion failure in UCX with the following trace:

[1,3]<stderr>:multicast_over_send():811 .. Making active message call dispatch=16303 destination=1
[1,3]<stderr>:[f8n02:141541:0:141541]     arbiter.c:213  Assertion `(arbiter)->guard == 0' failed: scheduling group from the arbiter callback
[1,3]<stderr>:==== backtrace ====
[1,3]<stderr>:    0  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_fatal_error_message+0xa8) [0x200003108438]
[1,3]<stderr>:    1  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_fatal_error_format+0xa4) [0x2000031085c4]
[1,3]<stderr>:    2  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_arbiter_group_schedule_nonempty+0x114) [0x2000030ffb44]
[1,3]<stderr>:    3  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libuct.so.0(uct_mm_ep_pending_add+0x104) [0x2000030a8484]
[1,3]<stderr>:    4  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(ucp_request_pending_add+0x70) [0x200003005d00]
[1,3]<stderr>:    5  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(+0x14c20) [0x200002ff4c20]
[1,3]<stderr>:    6  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(ucp_am_send_nb+0x6dc) [0x200002ff8d3c]
[1,3]<stderr>:    7  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl21UCXNativeInterfaceP2PILb1ELb0EE19multicast_over_sendEPNS1_20p2p_multicast_send_tEm+0x270) [0x200002d91570]
[1,3]<stderr>:    8  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl21UCXNativeInterfaceP2PILb1ELb0EE9multicastEPNS_14ni_multicast_tE+0x35c) [0x200002d91b7c]
[1,3]<stderr>:    9  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN4CCMI8Executor9AllreduceINS_17ConnectionManager14CommSeqConnMgrEE7advanceEv+0x804) [0x200002d2b7b4]
[1,3]<stderr>:   10  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN4CCMI8Executor9AllreduceINS_17ConnectionManager14CommSeqConnMgrEE20staticNotifySendDoneEPvS5_16libcoll_result_t+0xcc) [0x200002d2bd7c]
[1,3]<stderr>:   11  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl21UCXNativeInterfaceP2PILb1ELb0EE16UCXsendMcastDoneEPv12ucs_status_t+0x230) [0x200002cbd260]
[1,3]<stderr>:   12  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(+0x14b34) [0x200002ff4b34]
[1,3]<stderr>:   13  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libuct.so.0(uct_mm_ep_process_pending+0x8c) [0x2000030a852c]
[1,3]<stderr>:   14  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x174) [0x2000030ffcd4]
[1,3]<stderr>:   15  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libuct.so.0(uct_mm_iface_progress+0x15c) [0x2000030a581c]
[1,3]<stderr>:   16  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(ucp_worker_progress+0xa4) [0x20000300e314]
[1,3]<stderr>:   17  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl7Adapter10UCXContext7advanceEv+0x48) [0x200002ca2aa8]
[1,3]<stderr>:   18  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(LIBCOLL_Advance+0x28) [0x200002c85638]
...

This is process 3 attempting to send an active message to process 1 during a ucp_worker_progress call. Process 1 gets a similar error; it is attempting to send an active message to process 3 during a ucp_worker_progress call. I am using UCX 'master' downloaded last week. The active message size is 512kB + 32 bytes. Can I report more to help resolve this problem ?

snyjm-18 commented 5 years ago

To make sure I understand, you have 3 outstanding ucp_am_send on initiator and call to ucp_worker_progress to ensure completion. During this ucp_worker_progress call you get this error?

tjcw commented 5 years ago

I'm not sure how many outstanding ucp_am_send calls there will be; it will be somewhere from 1 to 4 (if you count a possible self-send). The application is doing a 4-process 'allreduce' but I'm not sure if it will be doing this with a 4-way all-to-all exchange or some other algorithm. I get the error during ucp_worker_progress, where this calls the application to drive an AM completion upcall, and this upcall makes a further AM call.

snyjm-18 commented 5 years ago

I am trying to reproduce the issue and have been unable to. I am using the following code. The variable "i" goes above 512kB.

        sstatus[0] = ucp_am_send_nb(receiver().ep(), am_id,
                                 buf.data(), 1, ucp_dt_make_contig(i),
                                 (ucp_send_callback_t) ucs_empty_function,
                                 send_reply);

        sstatus[1] = ucp_am_send_nb(receiver().ep(), am_id,
                                 buf.data(), 1, ucp_dt_make_contig(i),
                                 (ucp_send_callback_t) ucs_empty_function,
                                 send_reply);

        sstatus[2] = ucp_am_send_nb(receiver().ep(), am_id,
                                 buf.data(), 1, ucp_dt_make_contig(i),
                                 (ucp_send_callback_t) ucs_empty_function,
                                 send_reply);

        EXPECT_FALSE(UCS_PTR_IS_ERR(sstatus[0]));
        EXPECT_FALSE(UCS_PTR_IS_ERR(sstatus[1]));
        EXPECT_FALSE(UCS_PTR_IS_ERR(sstatus[2]));
        ucp_worker_progress(sender().worker());

        wait(sstatus[0]);
        wait(sstatus[1]);
        wait(sstatus[2]);

I am going to try not reusing the same buffer, but I doubt that'll change anything. Is there something that you are doing that I am not including?

snyjm-18 commented 5 years ago

Can I also see your sender completion callback?

snyjm-18 commented 5 years ago

Could you also set "export UCX_LOG_LEVEL=poll", run it, and attach the output, probably as a text file because it'll be a lot of output.

snyjm-18 commented 5 years ago

Based on the assertion and where it is, it looks like the completion callback is calling something that it should not. For example, ucp_worker_progress() cannot be called within ucp_worker_progress(). Although I don't think thats the issue because it gives a different error message. @yosefe may be able help as well.

tjcw commented 5 years ago

I am making additional active message calls in the completion callback. Is this allowed ? Take a look at the stack trace above to see what is going on. I'll also add the trace from process 3 ( the target of the active message) which also fails the same assertion at the same time. (I'm not making a call to ucp_worker_progress() from within ucp_worker_progress() ) .

tjcw commented 5 years ago

Process 3 trace

[1,3]<stderr>:multicast_over_send():811 .. Making active message call dispatch=16303 destination=1
[1,3]<stderr>:[f8n02:141541:0:141541]     arbiter.c:213  Assertion `(arbiter)->guard == 0' failed: scheduling group from the arbiter callback
[1,3]<stderr>:==== backtrace ====
[1,3]<stderr>:    0  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_fatal_error_message+0xa8) [0x200003108438]
[1,3]<stderr>:    1  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_fatal_error_format+0xa4) [0x2000031085c4]
[1,3]<stderr>:    2  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_arbiter_group_schedule_nonempty+0x114) [0x2000030ffb44]
[1,3]<stderr>:    3  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libuct.so.0(uct_mm_ep_pending_add+0x104) [0x2000030a8484]
[1,3]<stderr>:    4  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(ucp_request_pending_add+0x70) [0x200003005d00]
[1,3]<stderr>:    5  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(+0x14c20) [0x200002ff4c20]
[1,3]<stderr>:    6  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(ucp_am_send_nb+0x6dc) [0x200002ff8d3c]
[1,3]<stderr>:    7  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl21UCXNativeInterfaceP2PILb1ELb0EE19multicast_over_sendEPNS1_20p2p_multicast_send_tEm+0x270) [0x200002d91570]
[1,3]<stderr>:    8  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl21UCXNativeInterfaceP2PILb1ELb0EE9multicastEPNS_14ni_multicast_tE+0x35c) [0x200002d91b7c]
[1,3]<stderr>:    9  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN4CCMI8Executor9AllreduceINS_17ConnectionManager14CommSeqConnMgrEE7advanceEv+0x804) [0x200002d2b7b4]
[1,3]<stderr>:   10  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN4CCMI8Executor9AllreduceINS_17ConnectionManager14CommSeqConnMgrEE20staticNotifySendDoneEPvS5_16libcoll_result_t+0xcc) [0x200002d2bd7c]
[1,3]<stderr>:   11  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl21UCXNativeInterfaceP2PILb1ELb0EE16UCXsendMcastDoneEPv12ucs_status_t+0x230) [0x200002cbd260]
[1,3]<stderr>:   12  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(+0x14b34) [0x200002ff4b34]
[1,3]<stderr>:   13  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libuct.so.0(uct_mm_ep_process_pending+0x8c) [0x2000030a852c]
[1,3]<stderr>:   14  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x174) [0x2000030ffcd4]
[1,3]<stderr>:   15  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libuct.so.0(uct_mm_iface_progress+0x15c) [0x2000030a581c]
[1,3]<stderr>:   16  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libucp.so.0(ucp_worker_progress+0xa4) [0x20000300e314]
[1,3]<stderr>:   17  /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/pami_port/libcollectives.so.3(_ZN7LibColl7Adapter10UCXContext7advanceEv+0x48) [0x200002ca2aa8]
snyjm-18 commented 5 years ago

Yes, you should be able to do an active message send from a completion callback. Although I'll have to think about it. Can you set "export UCX_LOG_LEVEL=poll" and send me the trace. This will help me see how the arbiter value is changing.

tjcw commented 5 years ago

I turned trace on. The test case ran cleanly 2 times, then trapped on the 3rd run in the same way. I have 69 MB of trace; how can I get this to you ?

snyjm-18 commented 5 years ago

You can email it to me jms285@duke.edu

tjcw commented 5 years ago

run.log.zip Here is the trace, compressed with 'zip'

tjcw commented 5 years ago

@snyjm-18 You asked for the completion callback, but it isn't very revealing unless I give you the whole application ; and that isn't open source. So I think we'll have to try debugging this with the aid of whatever trace can be turned on and by writing scaffolding code to mimic what the application does.

snyjm-18 commented 5 years ago

@tjcw I understand. I'm at sort of a loss for what could be causing this, so I may have to wait for input from @yosefe before I continue. Can you try using a different transport for intranode communication and see if that helps? And just to double check, you are not also using any uct calls with the ucp calls?

tjcw commented 5 years ago

How do I select a transport to use for intranode communication ? ( This test case runs all 4 processes on 1 node )

snyjm-18 commented 5 years ago

export UCX_TLS=name_of_transport1,name_of_transportn This should help as well: https://github.com/openucx/ucx/wiki/UCX-environment-parameters

tjcw commented 5 years ago

No, I am not making any UCT calls

yosefe commented 5 years ago

@tjcw it's not allowed to make additional send calls from send completion callback. the intent was to allow it, but it's not really supported now.

snyjm-18 commented 5 years ago

Sorry, my mistake on that. However, you can make additional ucp_send_am calls from active message handlers.

tjcw commented 5 years ago

I have found a way of deferring the application's completion callback (and hence the active message calls) until after the ucp completion callback returns. This makes my test case pass. A possible unfortunate side effect of this is that I now refer to the ucp request structure after I make a ucp_request_free call for it, but this may be OK since I don't think the ucp request structure can be reused until after my reference to it is complete. I will leave this issue open for the time being for comments on the above reference, and for a discussion on how hard it would be to be able to make active message calls from within a completion callback. I have not tried selecting a subset of transports as I do not need to do that currently.

shamisp commented 5 years ago

@yosefe In UCP.H we don't state anywhere that we have this contain. Sounds like temporary (?) issue but this is something that has to be documented

snyjm-18 commented 5 years ago

@tjcw if you do any ucp sends between the the ucp_request_free call and the further referencing, then the request may be reused. If you don't make any send calls, it is still not guaranteed to be safe to use. @yosefe correct me if I am wrong.

yosefe commented 5 years ago

@snyjm-18 you are right. according to UCP API, must not use the request after calling ucp_request_free.

tjcw commented 5 years ago

OK, I have revised my code to defer freeing the request until after I have finished using it. So now I am in conformance with the UCP API.

tjcw commented 5 years ago

No longer a problem since I am deferring the active message call until after the callback completes. Closing.