ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
532 stars 371 forks source link

prov/psm3: Sometimes doesn't clean up stale shm regions #9893

Open zachdworkin opened 4 months ago

zachdworkin commented 4 months ago

PSM3 sometimes doesn't clean up stale shm regions. After a test is done you can still see some psm3.xxx.xxx shm regions in /dev/shm

Run a test several times. Then check /dev/shm and notice there are open shm regions that aren't cleaned up

It is expected that tests clean up all of their resources on completion.

Fabtests don't appear to report any errors and just leak the stale shm regions.

Environment: Rhel 8.6, 8.7, and 9.3, psm3

raffenet commented 4 months ago

We have noticed this in our internal MPICH tests. /dev/shm fills up with psm3 segments and makes shared memory unusable on the system without manual cleanup.

ToddRimmer commented 4 months ago

If a job fails or is killed, the provider has no chance to cleanup shared memory. I am not aware of any issues with cleanup for normally completed jobs. For failed jobs, cleanup such as follows is required and is typically integrated into job schedule post job cleanup: rm -rf /dev/shm/psm3_shm. /dev/shm/sem.psm3_nic_affinity /dev/shm/ psm3_nic_affinity*

If you are seeing a lack of cleanup for jobs which completed normally, and fully shutdown all OFI interfaces, please provide more details. I do not see such issues with OpenMPI nor Intel MPI

zachdworkin commented 4 months ago

@ToddRimmer in prov/shm there is a signal handler that cleans up the /dev/shm region in the event of a failure signal or abrupt failure. https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_signal.h. We don't do this manual cleanup for psm3 and it looks like mpich/argonne doesn't either. From what I can tell it looks like the "normal completion" path is okay. Can a signal handler like in prov/shm be implemented to solve this failure/interrupt path issue?

ToddRimmer commented 4 months ago

In our experience, a common job failure mode is for middleware or some other layer to call abort(), which results in a segfault signal and no guaranteed clean way for provider to cleanup as process memory status is suspect. So, our recommendation has been to place the cleanup commands in the job scheduler epilog script, the commands are provided in a script so documented in our Intel Ethernet Fabric Suite Package. This approach has worked well in the field for over a decade with PSM3 and its predecessors.

Can explore signal handler for some other signals, but given the need to address the common abort() case, the epilog approach is simple and covers all cases.

zachdworkin commented 4 months ago

@ToddRimmer: Does abort() send a sigsegv (signal segmentation fault)? I thought it sends sigabrt (signal abort) to the running process. In either case a signal handler would be able to interrupt both of these signals (or any signal except sigkill [signal kill]) and cleanup the /dev/shm/psm3* files before completing the signal. This is already done and working for the shm provider so maybe we can add it as an enhancement for psm3? I think having to add an extra cleanup outside of the code, that the code can and should handle, doesn't make sense for the long run.

I can add your workaround solution of the: "rm -rf /dev/shm/psm3_shm. /dev/shm/sem.psm3_nic_affinity /dev/shm/psm3_nic_affinity*" to our slurm epilog to handle this cleanup until the enhancement is complete.

ToddRimmer commented 4 months ago

exploring. We do have an existing signal handler, can you try an experiment with PSM3_BACKTRACE=1 In this case our signal handler gets installed and it will call exit for SIGTERM and SIGINT. We do have an atexit() function installed to do the cleanup, so I suspect in your case, exit is not being called. Granted, exit() is defined as "not safe" for use in a signal handler, however for SIGINT and SIGTERM is it typically safe enough. If this addresses the issue, you have a workaround and I can be confident that some layer above won't override our signal handler an could enable it in general (I believe in the past we had such situations, so were only able to do the atexit() approach).