sstsimulator / sst-macro

SST Macro Element Library
http://sst-simulator.org/
Other
33 stars 41 forks source link

Memory leak occurs when sending eager0 messages in pt2pt #702

Open qizhi45 opened 8 months ago

qizhi45 commented 8 months ago

Memory leak occurs when sending eager0 messages in pt2pt for sstmacro standalone model

1 - Detailed description of problem or enhancement

When calling MPI_Send and MPI_Recv to send and receive messages point-to-point, if the protocol for sending messages uses the eager0 protocol. Then a memory leak will occur. The reason for preliminary analysis is that when sending a message, Eager0::start() new the memory of smsgbuffer by calling fillSendBuffer(). When executing NetworkMessage::putOnWire(), a wirebuffer is newed, and copy smsgbuffer to it, but code does not free the smsgbuffer and directly sets smsgbuffer to nullptr, at this time, the previously newed memory will not have a pointer to manage, causing a memory leak. The memory released in the NetworkMessage destructor is actually the memory of wirebuffer.

In addition, the memory management of this part (network_message.cc, eager0.cc) is a bit confusing to me. There are too many new and delete, so that the memory management rights are lost. If so, can this part be structurally optimized? The heaptrack tools some analyse logs are following:

PEAK MEMORY CONSUMERS
1.64G peak memory consumed over 200000 calls from
sumi::MpiProtocol::fillSendBuffer(int, void*, sumi::MpiType*)
  at ../../sumi-mpi/mpi_protocol/mpi_protocol.cc:62
  in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
819.10M consumed over 100000 calls from:
    sumi::Eager0::start(void*, int, int, int, int, sumi::MpiType*, int, long, int, sumi::MpiRequest*)
      at ../../sumi-mpi/mpi_protocol/eager0.cc:71
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sumi::MpiQueue::send(sumi::MpiRequest*, int, unsigned short, int, int, sumi::MpiComm*, void*)
      at ../../sumi-mpi/mpi_queue/mpi_queue.cc:197
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sumi::MpiApi::send(void const*, int, unsigned short, int, int, long)
      at ../../sumi-mpi/mpi_api_send_recv.cc:81
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sstmac_send
      at ../../sumi-mpi/sstmac_mpi.cc:89
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    userSkeletonMain(int, char**)
      in ./osu_latency
    sstmac::sw::App::run()
      at ../../../sstmac/software/process/app.cc:539
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sstmac::sw::Thread::runRoutine(void*)
      at ../../../sstmac/software/process/thread.cc:141
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sstmac_make_fcontext
      at ../../../sstmac/software/threading/asm/make_x86_64_sysv_elf_gas.S:49
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
819.10M consumed over 100000 calls from:
    sumi::Eager0::start(void*, int, int, int, int, sumi::MpiType*, int, long, int, sumi::MpiRequest*)
      at ../../sumi-mpi/mpi_protocol/eager0.cc:71
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sumi::MpiQueue::send(sumi::MpiRequest*, int, unsigned short, int, int, sumi::MpiComm*, void*)
      at ../../sumi-mpi/mpi_queue/mpi_queue.cc:197
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sumi::MpiApi::send(void const*, int, unsigned short, int, int, long)
      at ../../sumi-mpi/mpi_api_send_recv.cc:81
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sstmac_send
      at ../../sumi-mpi/sstmac_mpi.cc:89
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    userSkeletonMain(int, char**)
      in ./osu_latency
    sstmac::sw::App::run()
      at ../../../sstmac/software/process/app.cc:539
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sstmac::sw::Thread::runRoutine(void*)
      at ../../../sstmac/software/process/thread.cc:141
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
    sstmac_make_fcontext
      at ../../../sstmac/software/threading/asm/make_x86_64_sysv_elf_gas.S:49
      in /home/WorkSpace/test/SSTMacroBuild/lib/libsstmac.so.12
total runtime: 20.91s.
calls to allocation functions: 9589794 (458710/s)
temporary memory allocations: 6111 (292/s)
peak heap memory consumption: 1.68G
peak RSS (including heaptrack overhead): 1.68G
total memory leaked: 1.64G
suppressed leaks: 10.97K

2 - Describe how to reproduce

git clone https://github.com/sstsimulator/sst-macro.git
./configure --prefix=/home/WorkSpace/test/SSTMacroBuild CFLAGS="-fPIC" CXXFLAGS="-fPIC"
make && make install
cd WorkSpace/test/sst-macro/skeletons/osu-micro-benchmarks-5.3.2
cd mpi/pt2pt
vim osu_latency.c
modify line 87 : for(size = 8191; size > 0; size = 0) {
modify line 99 : for(i = 0; i < 100000; i++) {
modify line 110 : for(i = 0; i < 100000; i++) {
/home/WorkSpce/test/SSTMacroBuild/bin/sst++ -o osu_latency osu_latency.c osu_pt2pt.c -I.
heaptrack /home/WorkSpce/test/SSTMacroBuild/bin/sstmac -f parameter.ini
heaptrack --analyse

If there is no heaptrack environment, you can use other memory leak detection tools, or observe sstmac memory usage through top/htop.

Parameter.ini is the following:

node {
  name = simple
  app1 {
    launch_cmd = aprun -n 2 -N 1
    exe=./osu_latency
  mpi {
    max_vshort_msg_size = 16384
    max_eager_msg_size = 16384
    post_header_delay   = 0.25us
    post_rdma_delay     = 0.3us
    rdma_pin_latency    = 0.3us
    rdma_page_delay     = 0.03ns
    }
  }
  proc {
    frequency = 2.6Ghz
    ncores = 60
    parallelism = 16
  }
  memory {
    name = pisces
    total_bandwidth = 51.2GB/s
    nchannels = 6
    latency = 12.5ns
    arbitrator = cut_through
    max_single_bandwidth = 51.2GB/s
  }
  nic {
    name = pisces
    negligible_size = 0
    injection {
      mtu = 4096
      arbitrator = cut_through
      bandwidth = 100Gb/s
      latency = 300ns
      credits = 64KB
    }
    ejection{
      mtu = 4096
      arbitrator = cut_through
      bandwidth = 100Gb/s
      latency = 300ns
      credits = 64KB
    }
  }
}
switch {
 router {
   name = fat_tree
 }
 name = pisces
 arbitrator = cut_through
 mtu = 512
 link {
  bandwidth = 1Gb/s
  latency = 130ns
  credits = 64KB
 }
 xbar {
  bandwidth = 16Tb/s
 }
 logp {
  bandwidth = 1Gb/s
  hop_latency = 100ns
  out_in_latency = 60ns
 }
}
topology {
 name = fat_tree 
 concentration = 8
 num_core_switches = 64
 down_ports_per_core_switch = 16
 num_agg_subtrees = 16
 agg_switches_per_subtree = 8
 up_ports_per_agg_switch = 8
 down_ports_per_agg_switch = 8
 leaf_switches_per_subtree = 8
 up_ports_per_leaf_switch = 8
}

3 - What Operating system(s) and versions

lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 8.5.2111
Release:        8.5.2111
Codename:       n/a

4 - What version of external libraries (Boost, MPI)

g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --disable-libmpx --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)

5 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc) SSTMAC repo: c30a5ce6e03220aa586ca8da5e137ac23dda0bef

6 - Fill out Labels, Milestones, and Assignee fields as best possible