Open hzhou opened 2 years ago
Separate from opx
, user also reported this leak:
==24466== 32 bytes in 1 blocks are still reachable in loss record 7 of 10
==24466== at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466== by 0x864D7E4: _dlerror_run (dlerror.c:140)
==24466== by 0x864D050: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==24466== by 0x71F92F9: ofi_load_dl_prov (fabric.c:692)
==24466== by 0x71F92F9: fi_ini (fabric.c:841)
==24466== by 0x71FA1CA: fi_getinfo (fabric.c:1094)
==24466== by 0x53259F2: find_provider (init_provider.c:115)
==24466== by 0x53259F2: MPIDI_OFI_find_provider (init_provider.c:71)
==24466== by 0x5303935: MPIDI_OFI_init_local (ofi_init.c:564)
==24466== by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466== by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466== by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466== by 0x4F66814: internal_Init (init.c:53)
==24466== by 0x4F66814: PMPI_Init (init.c:105)
==24466== by 0x1088BA: main (test_mpiio.c:5)
==24466==
==24466== 61 bytes in 1 blocks are still reachable in loss record 8 of 10
==24466== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466== by 0x4017880: _dl_exception_create (dl-exception.c:77)
==24466== by 0x7BFD250: _dl_signal_error (dl-error-skeleton.c:117)
==24466== by 0x4009812: _dl_map_object (dl-load.c:2384)
==24466== by 0x4014EE3: dl_open_worker (dl-open.c:235)
==24466== by 0x7BFD2DE: _dl_catch_exception (dl-error-skeleton.c:196)
==24466== by 0x40147C9: _dl_open (dl-open.c:605)
==24466== by 0x864CF95: dlopen_doit (dlopen.c:66)
==24466== by 0x7BFD2DE: _dl_catch_exception (dl-error-skeleton.c:196)
==24466== by 0x7BFD36E: _dl_catch_error (dl-error-skeleton.c:215)
==24466== by 0x864D734: _dlerror_run (dlerror.c:162)
==24466== by 0x864D050: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
I haven't tried to reproduce it yet. Does anyone have any clues?
Thank you for reporting this, we have opened an internal Jira to investigate/fix the OPX memory leaks.
Hey @hzhou ! We have fixed these memory leaks internally and intend to upstream these fixes in the near future.
Hello again @hzhou ! We (Cornelis) patched this issue with the release of 1.17.1 and, using valgrind, we are no longer able to reproduce the opx specific memory leaks reported above.
Would you be willing to confirm that you are no longer seeing the opx specific leaks (using either main or 1.17.1)?
@hzhou reminder
Sorry about that. Let me test it.
@hzhou Thanks for getting back to me and for retesting! t.log shows we've resolved the original reported memory leaks, but sadly, it seems we (opx) have introduced some new leaks. We'll take care of these new leaks and then ping here again.
We are still seeing the leak using
main
. Log attached. t.log
Hi @hzhou. Can share more on how you generated this report file?
Thanks.
@hzhou we have upstreamed all of our changes in December to address these memory leaks, we are not seeing any remaining OPX memory leaks especially with intelMPI, or MPICH. We are still working on some OpenMPI leaks. could you confirm which leaks you are seeing on your end?
Describe the bug We are seeing memory leaks with valgrind:
Looking at the code, it appears it is missing clean up for the various default attrs - https://github.com/ofiwg/libfabric/blob/a2120d6fc633abb76f158ca4fce79a0a80a62a70/prov/opx/src/fi_opx_init.c#L481-L494
To Reproduce I believe this can be reproduced running valgrind on
fi_info
.Environment: Linux