ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
563 stars 378 forks source link

Crash on performing fi_cq_read after a de-registration operation (fi_close) with psm2 provider #4423

Closed nitbhat closed 3 years ago

nitbhat commented 6 years ago

I have an application that uses the Zerocopy API in Charm++ with a libfabric (psm2 backend). I notice that when I de-register(fi_close) a buffer after the completion of an RDMA write operation (fi_write), a subsequent 'fi_cq_read' crashes with the following stack trace:

#0  psmx2_cq_poll_mq (cq=cq@entry=0xca14d0, trx_ctxt=0xbc0150, 
    event_in=event_in@entry=0x7fffffffb5f0, count=count@entry=8, 
    src_addr=src_addr@entry=0x0) at prov/psm2/src/psmx2_cq.c:576
#1  0x00002aaaadc5628e in psmx2_cq_readfrom (cq=0xca14d0, buf=0x7fffffffb5f0, 
    count=8, src_addr=0x0) at prov/psm2/src/psmx2_cq.c:738
#2  0x00000000007d3895 in fi_cq_read (cq=0xca14d0, buf=0x7fffffffb5f0, count=8)
    at /home/nbhat4/software/libfabric/build/include/rdma/fi_eq.h:385
#3  0x00000000007d8d07 in process_completion_queue () at machine.C:1181
#4  0x00000000007d8ea3 in LrtsAdvanceCommunication (whileidle=0)
    at machine.C:1317
#5  0x00000000007d3424 in AdvanceCommunication (whenidle=0)
    at machine-common-core.C:1621
#6  0x00000000007d368d in CmiGetNonLocal () at machine-common-core.C:1797
#7  0x00000000007dbf62 in CsdNextMessage (s=0x7fffffffb8d0) at convcore.c:1789
#8  0x00000000007dc0af in CsdScheduleForever () at convcore.c:1922
#9  0x00000000007dc009 in CsdScheduler (maxmsgs=-1) at convcore.c:1861
#10 0x00000000007d33ed in ConverseRunPE (everReturn=0)
    at machine-common-core.C:1601
#11 0x00000000007d32f7 in ConverseInit (argc=3, argv=0x7fffffffbb28, 
    fn=0x68c7b1 <_initCharm(int, char**)>, usched=0, initret=0)
    at machine-common-core.C:1494
#12 0x000000000068daba in charm_main (argc=3, argv=0x7fffffffbb28)
    at init.C:1704
---Type <return> to continue, or q <return> to quit---
#13 0x000000000068665c in main (argc=3, argv=0x7fffffffbb28) at main.C:5

I am running a non-threaded version of the application. And this crash is seen when the program is run on 1 physical node with 2 or more processes. The application runs seamlessly when I do not perform the de-registration operation.

I was able to reproduce this error with libfabric 1.4, 1.5 & 1.6 on Bridges (at PSC) and Stampede (at TACC).

nitbhat commented 6 years ago

The above error seems to be an issue with the psm2 provider. (as indicated by the stack trace).

I tried the same with the verbs provider and the program ran successfully.

j-xiong commented 6 years ago

Could you clarify what is passed to fi_close(), the memory region?

j-xiong commented 6 years ago

What is the version of libfabric used to generate the stack trace?

j-xiong commented 6 years ago

@nitbhat I got more information from @yburette in an offline discussion. I think the issue is the MR being closed too early. Could you try adding FI_DELIVERY_COMPLETE flag to the write?

yburette commented 6 years ago

Adding the FI_DELIVERY_COMPLETE seems to solve the issue. Thank you @j-xiong!

@nitbhat, I'm sending you my changes to machine-onesided.c via email.

j-xiong commented 3 years ago

This is already solved, close it now.