Closed nitbhat closed 3 years ago
The above error seems to be an issue with the psm2 provider. (as indicated by the stack trace).
I tried the same with the verbs provider and the program ran successfully.
Could you clarify what is passed to fi_close(), the memory region?
What is the version of libfabric used to generate the stack trace?
@nitbhat I got more information from @yburette in an offline discussion. I think the issue is the MR being closed too early. Could you try adding FI_DELIVERY_COMPLETE flag to the write?
Adding the FI_DELIVERY_COMPLETE seems to solve the issue. Thank you @j-xiong!
@nitbhat, I'm sending you my changes to machine-onesided.c via email.
This is already solved, close it now.
I have an application that uses the Zerocopy API in Charm++ with a libfabric (psm2 backend). I notice that when I de-register(fi_close) a buffer after the completion of an RDMA write operation (fi_write), a subsequent 'fi_cq_read' crashes with the following stack trace:
I am running a non-threaded version of the application. And this crash is seen when the program is run on 1 physical node with 2 or more processes. The application runs seamlessly when I do not perform the de-registration operation.
I was able to reproduce this error with libfabric 1.4, 1.5 & 1.6 on Bridges (at PSC) and Stampede (at TACC).