Memory leaks when calling mpi_file_write_all

andymwood commented 1 month ago

I'm getting memory leaks when calling mpi_file_write_all, using Open MPI 4.1.5 built from a source downloaded from here. Eventually the process runs out of memory and is killed by the operating system.

Valgrind also reports memory leaks from mpi_init and mpi_finalize, but I suspect it's the ones in mpi_file_write_all that are causing the major problems.

I built Open MPI Intel 2020.2, like so:

$ export FC=ifort CC=icc
$ ./configure --prefix=/home/andrew/install/Debug/openmpi-4.1.5 --with-psm2 --disable-psm2-version-check --with-slurm --with-pmi=/mnt/beegfs/software/slurm/19.05.2/../prod --enable-mem-debug --enable-debug

System

Operating system/version: CentOS 7
Computer hardware: Intel Corporation Xeon E7
Network type:

Details of the problem

The following code, test-leak.f90, reproduces the problem:

program test_leak
   use mpi
   implicit none

   integer, parameter :: nx=10,ng=1,ncopies=4
   integer :: ierr,comm,nproc,rank,fh,sizes(1),subsizes(1),starts(1),subarray,distribution
   integer(kind=mpi_offset_kind) :: loc
   double precision :: array(1-ng:nx+ng)
   character(len=13) :: filename = 'test-leak.dat'
   logical :: exists

   comm = mpi_comm_world

   call mpi_init(ierr)
   call mpi_comm_size(comm,nproc,ierr)
   call mpi_comm_rank(comm,rank,ierr)

   inquire(file=filename,exist=exists)
   if(exists) call mpi_file_delete(filename,mpi_info_null,ierr)
   call mpi_file_open(comm,filename,mpi_mode_create+mpi_mode_wronly,mpi_info_null,fh,ierr)

   sizes = [nx + 2*ng]
   subsizes = [nx]
   starts = [ng]
   call mpi_type_create_subarray(1,sizes,subsizes,starts,mpi_order_fortran,mpi_double_precision,subarray,ierr)
   call mpi_type_commit(subarray,ierr)

   sizes = [nproc*nx]
   starts = [nx*rank]
   call mpi_type_create_subarray(1,sizes,subsizes,starts,mpi_order_fortran,mpi_double_precision,distribution,ierr)
   call mpi_type_commit(distribution,ierr)

   loc = 0
   do loc = 0,(ncopies-1)*nproc*nx*8,ncopies
      call mpi_file_set_view(fh,loc,mpi_byte,distribution,'native',mpi_info_null,ierr)
      call mpi_file_write_all(fh,array,1,subarray,mpi_status_ignore,ierr)
   enddo

   call mpi_type_free(distribution,ierr)
   call mpi_type_free(subarray,ierr)

   call mpi_file_close(fh,ierr)

   call mpi_finalize(ierr)
end program test_leak

I'm using the following CMakeLists.txt to build the executable:

set(CMAKE_BUILD_TYPE Debug CACHE STRING "" FORCE)
set(CMAKE_Fortran_COMPILER ifort CACHE STRING "" FORCE)
set(CMAKE_C_COMPILER icc CACHE STRING "" FORCE)
set(CMAKE_CXX_COMPILER ipcp CACHE STRING "" FORCE)
cmake_minimum_required(VERSION 3.24)
project(testompi Fortran)
find_package(MPI REQUIRED)
include_directories(${MPI_Fortran_INCLUDE_PATH})
add_executable(test_ompi test-leak.f90)
target_link_libraries(test_ompi ${MPI_Fortran_LIBRARIES})

Compile and run like so:

$ cmake -B build
$ cmake --build build
$ mpirun mpirun --oversubscribe --mca btl ^openib,ofi -n 4 valgrind --num-callers=100 --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --suppressions=/home/andrew/install/Debug/openmpi-4.1.5/share/openmpi/openmpi-valgrind.supp --log-file=valgrind-out-%p.txt ./build/test_leak

Here's some sample output from valgrind:

==27384== 960 bytes in 240 blocks are definitely lost in loss record 53 of 69
==27384==    at 0x4C29E63: malloc (vg_replace_malloc.c:309)
==27384==    by 0x6C61EA9: opal_malloc (malloc.c:101)
==27384==    by 0x1194353C: ???
==27384==    by 0x180B6883: ???
==27384==    by 0x180AED65: ???
==27384==    by 0x11952030: ??? 
==27384==    by 0x10A130E8: ???
==27384==    by 0x565855A: PMPI_File_write_all (pfile_write_all.c:74)
==27384==    by 0x52F57D6: mpi_file_write_all (pfile_write_all_f.c:82)
==27384==    by 0x409B85: MAIN__ (test-leak.f90:36)
==27384==    by 0x409411: main (in /home/andrew/open-mpi-bugs/memory-leaks/build/test_ompi)
...
==27384== 23,136 bytes in 241 blocks are indirectly lost in loss record 64 of 69
==27384==    at 0x4C29E63: malloc (vg_replace_malloc.c:309)
==27384==    by 0x6C61EA9: opal_malloc (malloc.c:101)
==27384==    by 0x6BF9EC4: opal_datatype_optimize_short (opal_datatype_optimize.c:51)
==27384==    by 0x6BF9EC4: opal_datatype_commit (opal_datatype_optimize.c:310)
==27384==    by 0x5741866: ompi_datatype_commit (ompi_datatype.h:171)
==27384==    by 0x5741866: ompi_coll_base_allgatherv_intra_basic_default (coll_base_allgatherv.c:639)
==27384==    by 0x14D1536A: ???
==27384==    by 0x14D0BE08: ???
==27384==    by 0x180AFA4D: ??? 
==27384==    by 0x11952030: ???
==27384==    by 0x10A130E8: ???
==27384==    by 0x565855A: PMPI_File_write_all (pfile_write_all.c:74)
==27384==    by 0x52F57D6: mpi_file_write_all (pfile_write_all_f.c:82)
==27384==    by 0x409B85: MAIN__ (test-leak.f90:36)
==27384==    by 0x409411: main (in /home/andrew/open-mpi-bugs/memory-leaks/build/test_ompi)
...
==27384== LEAK SUMMARY:
==27384==    definitely lost: 202,280 bytes in 764 blocks
==27384==    indirectly lost: 183,408 bytes in 968 blocks
==27384==      possibly lost: 1,920 bytes in 2 blocks
==27384==    still reachable: 30,864 bytes in 107 blocks
==27384==         suppressed: 0 bytes in 0 blocks
==27384==    
==27384== ERROR SUMMARY: 59 errors from 59 contexts (suppressed: 0 from 0)

edgargabriel commented 1 month ago

@andymwood thank you for the report, I will have a look over the weekend

wenduwan commented 3 weeks ago

@andymwood The backtrace shows ompi_coll_base_allgatherv_intra_basic_default which creates new datatypes on the fly. This might be a concern.

If you have a chance, could you try the ring allgatherv algorithm which does not involve temporary datatypes with:

export OMPI_MCA_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_allgatherv_algorithm=3

Please let us know if this works.

bosilca commented 2 weeks ago

We talked about this today on slack and have some leads. First, the datatype built in the allgatherv is contiguous as indicated by the dump below:

Datatype 0x124e14380[] id -1 size 64 align 8 opal_id 0 length 13 used 1
true_lb 0 true_ub 64 (true_extent 64) lb 0 ub 64 (extent 64)
nbElems 8 loops 0 flags 134 (committed contiguous )-cC-----D--[---][---]
   contain OPAL_LONG:* 
--C---P-DH-[---][---]      OPAL_LONG count 1 disp 0x0 (0) blen 8 extent 64 (size 64)
-------G---[---][---]    OPAL_LOOP_E prev 1 elements first elem displacement 0 size of data 64
Optimized description 
-cC---P-DB-[---][---]      OPAL_LONG count 1 disp 0x0 (0) blen 8 extent 64 (size 64)
-------G---[---][---]    OPAL_LOOP_E prev 1 elements first elem displacement 0 size of data 64

Tracking the code by hand also indicates that the datatype is correctly released in the allgatherv. However, as we only release data based on the object refcount, it is possible that if the PML/MTL still has a reference onto the datatype (which would be normal for pending send/recv communications) then the datatype will not be release there, but later once the communications using it complete. @wenduwan recall that not long ago we had a bug in CM where the refcount of the datatype has not been released properly https://github.com/open-mpi/ompi/pull/12544. This PR is missing from the 4.x branch, so it could be the cause of the memory leaks.

@andymwood can you try to run with the OB1 PML please ? (add --mca pml ob1 to your mpirun)

wenduwan commented 2 weeks ago

I can confirm that applying #12544 on v4.1.x mitigates this leak

==826106== 23,136 bytes in 241 blocks are indirectly lost in loss record 105 of 115
==826106==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==826106==    by 0x502881C: opal_datatype_optimize_short (opal_datatype_optimize.c:51)
==826106==    by 0x5029C30: opal_datatype_commit (opal_datatype_optimize.c:310)
==826106==    by 0x4EEB2D8: ompi_datatype_commit (ompi_datatype.h:171)
==826106==    by 0x4EEC7CA: ompi_coll_base_allgatherv_intra_basic_default (coll_base_allgatherv.c:639)
==826106==    by 0x7F7DE63: ???
==826106==    by 0x7F78680: ???
==826106==    by 0x14134BB1: ???
==826106==    by 0x7EB2D48: ???
==826106==    by 0x7E9BC97: ???
==826106==    by 0x4E7ED96: PMPI_File_write_all (pfile_write_all.c:74)
==826106==    by 0x48A8B03: mpi_file_write_all (pfile_write_all_f.c:82)
==826106==    by 0x109623: MAIN__ (test_leak.f90:36)
==826106==    by 0x1096DB: main (test_leak.f90:2)

With --mca pml ob1 I do not see additional warnings inside ompi_coll_base_allgatherv_intra_basic_default

wenduwan commented 2 weeks ago

Backport is approved https://github.com/open-mpi/ompi/pull/12772

This should be included in 4.1.7 release.

edgargabriel commented 2 weeks ago

@wenduwan was that fix back ported to the 5.0 branch as well?

wenduwan commented 2 weeks ago

@edgargabriel Yes both main and v5.0.x have the fix

andymwood commented 2 weeks ago

@andymwood The backtrace shows ompi_coll_base_allgatherv_intra_basic_default which creates new datatypes on the fly. This might be a concern.

If you have a chance, could you try the ring allgatherv algorithm which does not involve temporary datatypes with:
export OMPI_MCA_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_allgatherv_algorithm=3
Please let us know if this works.

Thanks for the responses.

Setting these environment variables has no effect on the test problem I supplied.

andymwood commented 2 weeks ago

@andymwood can you try to run with the OB1 PML please ? (add --mca pml ob1 to your mpirun)

This option reduces the number of leaks from mpi_file_write_all in the short test case to just one:

==11830== 960 bytes in 240 blocks are definitely lost in loss record 44 of 54
==11830==    at 0x4C29E63: malloc (vg_replace_malloc.c:309)
==11830==    by 0x15F61CCD: ???
==11830==    by 0x169EC5FE: ???
==11830==    by 0x169EAE0D: ???
==11830==    by 0x15F697A5: ???
==11830==    by 0x15D5695E: ???
==11830==    by 0x5564388: PMPI_File_write_all (in /home/andrew/install/Release/openmpi-4.1.5/lib/libmpi.so.40.30.5)
==11830==    by 0x52CF270: PMPI_FILE_WRITE_ALL (in /home/andrew/install/Release/openmpi-5.1.5/lib/libmpi_mpifh.so.40.30.0)
==11830==    by 0x409C05: MAIN__ (test-leak.f90:36)
==11830==    by 0x409491: main (in /home/andrew/open-mpi-bugs/memory-leaks/build/test_ompi)
...
==11830== LEAK SUMMARY:
==11830==    definitely lost: 23,863 bytes in 290 blocks
==11830==    indirectly lost: 560 bytes in 4 blocks
==11830==      possibly lost: 1,920 bytes in 2 blocks
==11830==    still reachable: 30,512 bytes in 107 blocks
==11830==         suppressed: 0 bytes in 0 block

The full problem that I'm running takes much longer. I'll report back later.

andymwood commented 2 weeks ago

Adding --mca pml ob1 to the mpirun command line doesn't help with my main problem. It still crashes at the same place leaving an out-of-memory error in the system logs.

ggouaillardet commented 1 week ago

@andymwood

can you please give the inline patch below a try? an other option could be to force using ROMIO instead of OMPIO (`mpirun --mca io ^ompio ...)

diff --git a/ompi/mca/common/ompio/common_ompio_aggregators.c b/ompi/mca/common/ompio/common_ompio_aggregators.c
index a6448f97e9..c31cbf5412 100644
--- a/ompi/mca/common/ompio/common_ompio_aggregators.c
+++ b/ompi/mca/common/ompio/common_ompio_aggregators.c
@@ -529,7 +529,7 @@ int mca_common_ompio_set_aggregator_props (struct ompio_file_t *fh,
         }

         fh->f_num_aggrs = fh->f_init_num_aggrs;
-        fh->f_aggr_list = (int*) malloc ( fh->f_num_aggrs * sizeof(int));
+        fh->f_aggr_list = (int*) realloc( fh->f_aggr_list, fh->f_num_aggrs * sizeof(int));
         if (NULL == fh->f_aggr_list ) {
             opal_output (1, "OUT OF MEMORY\n");
             return OMPI_ERR_OUT_OF_RESOURCE;

@edgargabriel I noted mca_common_ompio_set_aggregator_props() is invoked multiple times my MPI_File_write_all(), but at this stage fh->f_aggr_list is never freed, and hence the memory leaks. realloc() does the trick, but at first glance, I am not convinced this is the right fix.

edgargabriel commented 1 week ago

@ggouaillardet Thank you, I think you are on the right track. mca_common_ompio_set_aggregator_props() is only invoked once per MPI_File_write_all() per process (or at least it should be). However, I think you are correct that we are not freeing that array at the end of the operations, so if the code is calling File_write_all() multiple times, we will loose memory there. I will check out in more details tomorrow

Thanks again!

andymwood commented 1 week ago

I've created another suppression file to suppress what I think are false positives, or leaks that don't matter ompi-supp.txt

@andymwood

an other option could be to force using ROMIO instead of OMPIO (`mpirun --mca io ^ompio ...)

With the attached suppression file and this option, the leaks are reduced to this:

==7355== 8 bytes in 1 blocks are definitely lost in loss record 7 of 59
==7355==    at 0x4C29E63: malloc (vg_replace_malloc.c:309)
==7355==    by 0x6C61EA9: opal_malloc (malloc.c:101)
==7355==    by 0x10A14CF8: ??? 
==7355==    by 0x5783B29: query_2_0_0 (io_base_file_select.c:415)
==7355==    by 0x5783B29: query (io_base_file_select.c:394)
==7355==    by 0x5783B29: check_one_component (io_base_file_select.c:354)
==7355==    by 0x5783B29: check_components (io_base_file_select.c:321)
==7355==    by 0x5781B7F: mca_io_base_file_select (io_base_file_select.c:155)
==7355==    by 0x55B0E53: ompi_file_open (file.c:135)
==7355==    by 0x564D087: PMPI_File_open (pfile_open.c:109)
==7355==    by 0x52F41AE: mpi_file_open (pfile_open_f.c:88)
==7355==    by 0x4095A4: MAIN__ (test-leak.f90:20)
==7355==    by 0x409411: main (in /home/andrew/open-mpi-bugs/memory-leaks/build/test_ompi)
==7355== 
==7355== LEAK SUMMARY:
==7355==    definitely lost: 8 bytes in 1 blocks
==7355==    indirectly lost: 0 bytes in 0 blocks
==7355==      possibly lost: 0 bytes in 0 blocks
==7355==    still reachable: 0 bytes in 0 blocks
==7355==         suppressed: 75,766 bytes in 145 blocks
==7355== 
==7355== For counts of detected and suppressed errors, rerun with: -v
==7355== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 56 from 56)

I'm currently running the full problem to see if that fixes the out-of-memory error.

I'll see what effect the suggested patch has later.

andymwood commented 1 week ago

@andymwood

can you please give the inline patch below a try? an other option could be to force using ROMIO instead of OMPIO (`mpirun --mca io ^ompio ...)
diff --git a/ompi/mca/common/ompio/common_ompio_aggregators.c b/ompi/mca/common/ompio/common_ompio_aggregators.c
index a6448f97e9..c31cbf5412 100644
--- a/ompi/mca/common/ompio/common_ompio_aggregators.c
+++ b/ompi/mca/common/ompio/common_ompio_aggregators.c
@@ -529,7 +529,7 @@ int mca_common_ompio_set_aggregator_props (struct ompio_file_t *fh,
         }

         fh->f_num_aggrs = fh->f_init_num_aggrs;
-        fh->f_aggr_list = (int*) malloc ( fh->f_num_aggrs * sizeof(int));
+        fh->f_aggr_list = (int*) realloc( fh->f_aggr_list, fh->f_num_aggrs * sizeof(int));
         if (NULL == fh->f_aggr_list ) {
             opal_output (1, "OUT OF MEMORY\n");
             return OMPI_ERR_OUT_OF_RESOURCE;
@edgargabriel I noted mca_common_ompio_set_aggregator_props() is invoked multiple times my MPI_File_write_all(), but at this stage fh->f_aggr_list is never freed, and hence the memory leaks. realloc() does the trick, but at first glance, I am not convinced this is the right fix.

I've now tried this patch and it doesn't have any effect on the test case I provided.

ggouaillardet commented 1 week ago

@andymwood this is odd.

with the patch and the program you initially provided

mpirun --mca pml ob1 --mca btl self,vader -np 4 ~/local/valgrind-3.23.0/bin/valgrind --leak-check=full --show-leak-kinds=all ./leak 2>&1 | grep 'definitely lost:'
==2456453==    definitely lost: 187 bytes in 5 blocks
==2456452==    definitely lost: 161 bytes in 3 blocks
==2456455==    definitely lost: 187 bytes in 5 blocks
==2456454==    definitely lost: 187 bytes in 5 blocks

without the patch

mpirun --mca pml ob1 --mca btl self,vader -np 4 ~/local/valgrind-3.23.0/bin/valgrind --leak-check=full --show-leak-kinds=all ./leak 2>&1 | grep 'definitely lost:'
==2923170==    definitely lost: 1,121 bytes in 243 blocks
==2923173==    definitely lost: 1,147 bytes in 245 blocks
==2923171==    definitely lost: 1,147 bytes in 245 blocks
==2923172==    definitely lost: 1,147 bytes in 245 blocks

fwiw, i run on a lustre filesystem.

note the leak is in io/ompio, so you can mpirun --mca io ompio ... to make sure this is used instead of romio.

you can configure --enable-debug --disable-dlopen ... so valgrind will display the full stack pointing to the memory leak.

ggouaillardet commented 1 week ago

@edgargabriel I ran the reproducer on the debugger, and indeed, mca_common_ompio_set_aggregator_props() is invoked once per call to MPI_File_write_all(). Since fh->f_aggr_list is only free'd on MPI_File_close(), it gets overwritten and hence the memory leak. One way to plug the leak would be to free it when leaving MPI_File_write_all(), but this is suboptimal since MPI_File_write_all() is generally called more than once per file. realloc() does not look that bad performance wise. My understanding is that for a given file, mca_common_ompio_set_aggregator_props() can have an effect only if a size of data is to be written. My gut feeling is that is is common to invoke MPI_File_write_all() multiple times with the same data size (e.g. count * datatype size) so a possible better option would be to cache the previous data size, and skip the call to mca_common_ompio_set_aggregator_props() when writing again the same data size.

edgargabriel commented 1 week ago

@ggouaillardet you are technically correct, and I am fundamentally ok with the realloc() solution that you suggest. Note, that e.g. a similar parameter on the file handle (fh->procs_in_group) is freed at the end of the write_all routine, see https://github.com/open-mpi/ompi/blob/main/ompi/mca/fcoll/vulcan/fcoll_vulcan_file_write_all.c#L738 , so not freeing the fh->f_aggr_list is really just on oversight.

I would however refrain from trying to cache the values from one call to another, since the user is technically allowed to set a different file view between subsequent calls. We can investigate whether something along those lines could be done, but I would suggest to do that separately, outside of this patch. I would like to make sure that we do not have any unintended side-effects.

edgargabriel commented 1 day ago

@ggouaillardet would you like to file pr with the fix, or shall I go ahead and do that?

ggouaillardet commented 1 day ago

@edgargabriel please go ahead, thanks!

open-mpi / ompi

Memory leaks when calling mpi_file_write_all #12677

System

Details of the problem