Closed philip-davis closed 5 years ago
Gemini stores the pointer as well, so this is a problem on Titan too. Titan seems to do some things with memory allocation differently from Cori, such that Cori has exposed other memory errors before. I will have to change everywhere the GNI_PostRdma
is called with a pointer to a stack variable, and also add a free
in __process_event
.
Implemented the fix in the debug branch (for DART, not DIMES - I'm about 85% sure that DIMES does not have this problem, but I need to trace the RPC creation code backwards to be sure.). ORNL is checking if this resolves their issue.
Resolved!
See (for example) rpc_fetch_request and rpc_post_request in dart_rpc_gni.c. The passed pointer is stored (at least in Aries), not used for copying, so this causes problems when it goes out of scope. I am checking on Titan to see if this same behavior happens with Gemini. This appears to be the cause of processing errors in __process_event, when running GNI_GetCompleted.