ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
581 stars 386 forks source link

prov/opx: memory leak #8091

Open hzhou opened 2 years ago

hzhou commented 2 years ago

Describe the bug We are seeing memory leaks with valgrind:

$ mpirun -n 1 valgrind --leak-check=full --show-leak-kinds=all ./cpi
==726034== Memcheck, a memory error detector
==726034== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==726034== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==726034== Command: ./cpi
==726034==
Process 0 of 1 is on tiger
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.004254
==726034==
==726034== HEAP SUMMARY:
==726034==     in use at exit: 437 bytes in 5 blocks
==726034==   total heap usage: 10,479 allocs, 10,474 frees, 12,864,283 bytes allocated
==726034==
==726034== 5 bytes in 1 blocks are still reachable in loss record 1 of 5
==726034==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==726034==    by 0x741538E: strdup (strdup.c:42)
==726034==    by 0x6CBEE37: fi_opx_alloc_default_domain_attr (fi_opx_domain.c:121)
==726034==    by 0x6CE91DC: fi_opx_ini (fi_opx_init.c:467)
==726034==    by 0x6B7B077: fi_ini (fabric.c:857)
==726034==    by 0x6B7B97B: fi_getinfo (fabric.c:1094)
==726034==    by 0x4B68248: find_provider (init_provider.c:115)
==726034==    by 0x4B68248: MPIDI_OFI_find_provider (init_provider.c:71)
==726034==    by 0x4B7E549: MPIDI_OFI_init_local (ofi_init.c:564)
==726034==    by 0x4BA6B6E: MPID_Init (ch4_init.c:508)
==726034==    by 0x4B0DD29: MPII_Init_thread (mpir_init.c:230)
==726034==    by 0x4B0E729: MPIR_Init_impl (mpir_init.c:102)
==726034==    by 0x499EE71: internal_Init (c_binding.c:45877)
==726034==    by 0x499EE71: PMPI_Init (c_binding.c:45929)
==726034==
==726034== 64 bytes in 1 blocks are still reachable in loss record 2 of 5
==726034==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==726034==    by 0x6CE4175: fi_opx_alloc_default_rx_attr (fi_opx_ep.c:1418)
==726034==    by 0x6CE922A: fi_opx_ini (fi_opx_init.c:478)
==726034==    by 0x6B7B077: fi_ini (fabric.c:857)
==726034==    by 0x6B7B97B: fi_getinfo (fabric.c:1094)
==726034==    by 0x4B68248: find_provider (init_provider.c:115)
==726034==    by 0x4B68248: MPIDI_OFI_find_provider (init_provider.c:71)
==726034==    by 0x4B7E549: MPIDI_OFI_init_local (ofi_init.c:564)
==726034==    by 0x4BA6B6E: MPID_Init (ch4_init.c:508)
==726034==    by 0x4B0DD29: MPII_Init_thread (mpir_init.c:230)
==726034==    by 0x4B0E729: MPIR_Init_impl (mpir_init.c:102)
==726034==    by 0x499EE71: internal_Init (c_binding.c:45877)
==726034==    by 0x499EE71: PMPI_Init (c_binding.c:45929)
==726034==    by 0x10933C: main (in /home/hzhou/work/pull_requests/2210_hydra_pg/cpi)
==726034==
==726034== 80 bytes in 1 blocks are still reachable in loss record 3 of 5
==726034==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==726034==    by 0x6CE4312: fi_opx_alloc_default_tx_attr (fi_opx_ep.c:1459)
==726034==    by 0x6CE9210: fi_opx_ini (fi_opx_init.c:475)
==726034==    by 0x6B7B077: fi_ini (fabric.c:857)
==726034==    by 0x6B7B97B: fi_getinfo (fabric.c:1094)
==726034==    by 0x4B68248: find_provider (init_provider.c:115)
==726034==    by 0x4B68248: MPIDI_OFI_find_provider (init_provider.c:71)
==726034==    by 0x4B7E549: MPIDI_OFI_init_local (ofi_init.c:564)
==726034==    by 0x4BA6B6E: MPID_Init (ch4_init.c:508)
==726034==    by 0x4B0DD29: MPII_Init_thread (mpir_init.c:230)
==726034==    by 0x4B0E729: MPIR_Init_impl (mpir_init.c:102)
==726034==    by 0x499EE71: internal_Init (c_binding.c:45877)
==726034==    by 0x499EE71: PMPI_Init (c_binding.c:45929)
==726034==    by 0x10933C: main (in /home/hzhou/work/pull_requests/2210_hydra_pg/cpi)
==726034==
==726034== 96 bytes in 1 blocks are still reachable in loss record 4 of 5
==726034==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==726034==    by 0x6CE4551: fi_opx_alloc_default_ep_attr (fi_opx_ep.c:1508)
==726034==    by 0x6CE91F6: fi_opx_ini (fi_opx_init.c:471)
==726034==    by 0x6B7B077: fi_ini (fabric.c:857)
==726034==    by 0x6B7B97B: fi_getinfo (fabric.c:1094)
==726034==    by 0x4B68248: find_provider (init_provider.c:115)
==726034==    by 0x4B68248: MPIDI_OFI_find_provider (init_provider.c:71)
==726034==    by 0x4B7E549: MPIDI_OFI_init_local (ofi_init.c:564)
==726034==    by 0x4BA6B6E: MPID_Init (ch4_init.c:508)
==726034==    by 0x4B0DD29: MPII_Init_thread (mpir_init.c:230)
==726034==    by 0x4B0E729: MPIR_Init_impl (mpir_init.c:102)
==726034==    by 0x499EE71: internal_Init (c_binding.c:45877)
==726034==    by 0x499EE71: PMPI_Init (c_binding.c:45929)
==726034==    by 0x10933C: main (in /home/hzhou/work/pull_requests/2210_hydra_pg/cpi)
==726034==
==726034== 192 bytes in 1 blocks are still reachable in loss record 5 of 5
==726034==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==726034==    by 0x6CBEDEA: fi_opx_alloc_default_domain_attr (fi_opx_domain.c:112)
==726034==    by 0x6CE91DC: fi_opx_ini (fi_opx_init.c:467)
==726034==    by 0x6B7B077: fi_ini (fabric.c:857)
==726034==    by 0x6B7B97B: fi_getinfo (fabric.c:1094)
==726034==    by 0x4B68248: find_provider (init_provider.c:115)
==726034==    by 0x4B68248: MPIDI_OFI_find_provider (init_provider.c:71)
==726034==    by 0x4B7E549: MPIDI_OFI_init_local (ofi_init.c:564)
==726034==    by 0x4BA6B6E: MPID_Init (ch4_init.c:508)
==726034==    by 0x4B0DD29: MPII_Init_thread (mpir_init.c:230)
==726034==    by 0x4B0E729: MPIR_Init_impl (mpir_init.c:102)
==726034==    by 0x499EE71: internal_Init (c_binding.c:45877)
==726034==    by 0x499EE71: PMPI_Init (c_binding.c:45929)
==726034==    by 0x10933C: main (in /home/hzhou/work/pull_requests/2210_hydra_pg/cpi)
==726034==
==726034== LEAK SUMMARY:
==726034==    definitely lost: 0 bytes in 0 blocks
==726034==    indirectly lost: 0 bytes in 0 blocks
==726034==      possibly lost: 0 bytes in 0 blocks
==726034==    still reachable: 437 bytes in 5 blocks
==726034==         suppressed: 0 bytes in 0 blocks
==726034==
==726034== For lists of detected and suppressed errors, rerun with: -s
==726034== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Looking at the code, it appears it is missing clean up for the various default attrs - https://github.com/ofiwg/libfabric/blob/a2120d6fc633abb76f158ca4fce79a0a80a62a70/prov/opx/src/fi_opx_init.c#L481-L494

To Reproduce I believe this can be reproduced running valgrind on fi_info.

Environment: Linux

hzhou commented 2 years ago

Separate from opx, user also reported this leak:

==24466== 32 bytes in 1 blocks are still reachable in loss record 7 of 10
==24466==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x864D7E4: _dlerror_run (dlerror.c:140)
==24466==    by 0x864D050: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==24466==    by 0x71F92F9: ofi_load_dl_prov (fabric.c:692)
==24466==    by 0x71F92F9: fi_ini (fabric.c:841)
==24466==    by 0x71FA1CA: fi_getinfo (fabric.c:1094)
==24466==    by 0x53259F2: find_provider (init_provider.c:115)
==24466==    by 0x53259F2: MPIDI_OFI_find_provider (init_provider.c:71)
==24466==    by 0x5303935: MPIDI_OFI_init_local (ofi_init.c:564)
==24466==    by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466==    by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 61 bytes in 1 blocks are still reachable in loss record 8 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x4017880: _dl_exception_create (dl-exception.c:77)
==24466==    by 0x7BFD250: _dl_signal_error (dl-error-skeleton.c:117)
==24466==    by 0x4009812: _dl_map_object (dl-load.c:2384)
==24466==    by 0x4014EE3: dl_open_worker (dl-open.c:235)
==24466==    by 0x7BFD2DE: _dl_catch_exception (dl-error-skeleton.c:196)
==24466==    by 0x40147C9: _dl_open (dl-open.c:605)
==24466==    by 0x864CF95: dlopen_doit (dlopen.c:66)
==24466==    by 0x7BFD2DE: _dl_catch_exception (dl-error-skeleton.c:196)
==24466==    by 0x7BFD36E: _dl_catch_error (dl-error-skeleton.c:215)
==24466==    by 0x864D734: _dlerror_run (dlerror.c:162)
==24466==    by 0x864D050: dlopen@@GLIBC_2.2.5 (dlopen.c:87)

I haven't tried to reproduce it yet. Does anyone have any clues?

belynam commented 2 years ago

Thank you for reporting this, we have opened an internal Jira to investigate/fix the OPX memory leaks.

tmh97 commented 1 year ago

Hey @hzhou ! We have fixed these memory leaks internally and intend to upstream these fixes in the near future.

tmh97 commented 1 year ago

Hello again @hzhou ! We (Cornelis) patched this issue with the release of 1.17.1 and, using valgrind, we are no longer able to reproduce the opx specific memory leaks reported above.

Would you be willing to confirm that you are no longer seeing the opx specific leaks (using either main or 1.17.1)?

tmh97 commented 1 year ago

@hzhou reminder

hzhou commented 1 year ago

Sorry about that. Let me test it.

hzhou commented 1 year ago

We are still seeing the leak using main. Log attached. t.log

tmh97 commented 12 months ago

@hzhou Thanks for getting back to me and for retesting! t.log shows we've resolved the original reported memory leaks, but sadly, it seems we (opx) have introduced some new leaks. We'll take care of these new leaks and then ping here again.

eliekozah commented 10 months ago

We are still seeing the leak using main. Log attached. t.log

Hi @hzhou. Can share more on how you generated this report file?

Thanks.

eliekozah commented 10 months ago

@hzhou we have upstreamed all of our changes in December to address these memory leaks, we are not seeing any remaining OPX memory leaks especially with intelMPI, or MPICH. We are still working on some OpenMPI leaks. could you confirm which leaks you are seeing on your end?