sstsimulator / sst-macro

SST Macro Element Library
http://sst-simulator.org/
Other
34 stars 41 forks source link

Segmentation fault during MPI_Scatter with large buffer #619

Closed afranques closed 3 years ago

afranques commented 3 years ago

Hello! I was trying to characterize the performance of MPI_Scatter over different message sizes using SST-Macro with the integrated SST-Core, and I realized that the simulator crashes when the sending buffer is larger than ~128 KB (~32,000 MPI_INT elements of 4 bytes each).

The application I am running is the following (which I call main.cc), and it consists of a simple MPI_Scatter call with a few bells and whistles around it:

#include <mpi.h>
#include <stdio.h>

// We have 1536 ranks (24 cores per node, 4x4x4 nodes)
#define NRANKS 1536
#define NELEMS_PER_RANK 20
#define NELEMS (NRANKS*NELEMS_PER_RANK)
#define ROOT 0

int main(int argc, char** argv) {
  int rank, size;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int nelems = NELEMS; // total number of elements (each of 4 bytes) we have in the send buffer
  int sendbuf[NELEMS];
  int recvbuf[NELEMS_PER_RANK];

  // The root fills the send buffer with the values we want to scatter
  if (rank == ROOT){
    for (int i=0; i < nelems; ++i){
      // This is done so that all elements sent to each rank have value equal to its rank number
      // Example: if NELEMS_PER_RANK = 4 --> sendbuf = [0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 ...]             
      sendbuf[i] = i/NELEMS_PER_RANK;
    }

    printf("Size of 1 element: %zu bytes\n", sizeof(sendbuf[0]));
    printf("Number of elements sent per rank: %d (%zu bytes)\n", NELEMS_PER_RANK, NELEMS_PER_RANK*sizeof(sendbuf[0]));
    printf("Size of sendbuf: %zu bytes (%d elements)\n", sizeof(sendbuf), NELEMS);
  }

  // We scatter NELEMS_PER_RANK to each rank (of MPI_INT type each element) 
  // from ROOT's sendbuf to each rank's recvbuf
  MPI_Scatter(sendbuf, NELEMS_PER_RANK, MPI_INT, recvbuf, NELEMS_PER_RANK, MPI_INT, ROOT, MPI_COMM_WORLD);

  // All ranks check they received the data they were supposed to
  for (int i=0; i < NELEMS_PER_RANK; i++) {
    if (rank != recvbuf[i]) {
      printf("Rank %d has NOT received the expected data (instead received %d)\n", rank, recvbuf[i]);
    }
  }

  if (rank == ROOT){
    printf("Rank %d finished at t = %8.4f ms\n", ROOT, MPI_Wtime()*1e3);
  }

  MPI_Finalize();

  if (rank == ROOT){
    printf("Rank %d passed MPI_Finalize\n", ROOT);
  }

  return 0;
}

This is the Makefile I use:

TARGET := runscatter
SRC := main.cc

CXX :=    sst++
CC :=     sstcc
CXXFLAGS := -fPIC
CPPFLAGS := -I.
LIBDIR :=  
PREFIX := 
LDFLAGS :=  -Wl,-rpath,$(PREFIX)/lib

OBJ := $(SRC:.cc=.o) 
OBJ := $(OBJ:.cpp=.o)
OBJ := $(OBJ:.c=.o)

.PHONY: clean install 

all: $(TARGET)

$(TARGET): $(OBJ) 
    $(CXX) -o $@ $+ $(LDFLAGS) $(LIBS)  $(CXXFLAGS)

%.o: %.cc 
    $(CXX) $(CPPFLAGS) $(CXXFLAGS) -c $< -o $@

%.o: %.c
    $(CC) $(CPPFLAGS) $(CFLAGS) -c $< -o $@

clean: 
    rm -f $(TARGET) $(OBJ) 

install: $(TARGET)
    cp $< $(PREFIX)/bin

This is the parameters file I use (which I name parameters.ini):

include small_torus.ini

node {
    app1 {
        # 24 cores per node, 4x4x4 nodes in total
        launch_cmd = aprun -n 1536 -N 24 
        exe=./runscatter
    }
}

Which uses the small_torus.ini configuration that comes with the simulator

I have all these files in the same folder, so I just compile by typing: $> make And I run from inside this same folder with: $> pysstmac -f parameters.ini

The output I get is:

/.../local/sstmacro-11.0.0/bin/pysstmac: line 7: 1653328 Segmentation fault      /.../local/sstcore-10.0.0/bin/sst /.../local/sstmacro-11.0.0/include/python/default.py --model-options="$options"

As mentioned at the beginning, this segmentation fault happens when NELEMS_PER_RANK is above 20, which assuming we have 1536 ranks in total, and each element is 4 bytes, it means it happens when sendbuf is above around 1536 x 20 x 4 ~= 128 KBytes. When NELEMS_PER_RANK is lower or equal than 20 the simulation finishes successfully.

Any help will be highly appreciated!

Thanks, Antonio

sknigh commented 3 years ago

@afranques, could you tell me which branch you are using?

afranques commented 3 years ago

Hello @sknigh! I am using the master branch.

Thanks, Antonio

afranques commented 3 years ago

@sknigh, I tried allocating the array dynamically (using malloc) instead of statically, and it now works! I still don't know why large statically-allocated arrays cause the segmentation fault in the simulator and not when run natively in the machine, but at least it works (for now) when the array is dynamically-allocated. I am attaching the new code here in case it helps anyone in the future ;-)

Thanks to anyone who took a look at my post!

#include <mpi.h>
#include <stdio.h>

// We have 1536 ranks (24 cores per node, 4x4x4 nodes)
#define NRANKS 1536
#define NELEMS_PER_RANK 20
#define NELEMS (NRANKS*NELEMS_PER_RANK)
#define ROOT 0

int main(int argc, char** argv) {
  int rank, size;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int nelems = NELEMS; // total number of elements (each of 4 bytes) we have in the send buffer
  int* sendbuf;
  int* recvbuf;

  // Only the root needs a properly sized send buffer
  if (rank == ROOT) {
    sendbuf = (int*) malloc(NELEMS * sizeof(int));
  }
  // All other ranks simply set up a dummy send buffer with one element (which won't be used)
  else {
    sendbuf = (int*) malloc(1 * sizeof(int));
  }

  recvbuf = (int*) malloc(NELEMS_PER_RANK * sizeof(int));

  // Check if the memory has been successfully allocated by malloc or not
  if (sendbuf == NULL || recvbuf == NULL) {
      printf("Rank %d: Memory not allocated.\n", rank);
      exit(0);
  }

  // The root fills the send buffer with the values we want to scatter
  if (rank == ROOT){
    for (int i=0; i < nelems; ++i){
      // If NELEMS_PER_RANK = 4 --> sendbuf = [0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3]
      // This is done so that each rank will send a group of elements all with same value (so receiver can check after scatter)            
      sendbuf[i] = i/NELEMS_PER_RANK;
      // printf("Root: sendbuf[%d]=%d\n", i, sendbuf[i]);
    }

    printf("Size of 1 element: %zu bytes\n", sizeof(sendbuf[0]));
    printf("Number of elements sent per rank: %d (%zu bytes)\n", NELEMS_PER_RANK, NELEMS_PER_RANK*sizeof(sendbuf[0]));
    printf("Size of sendbuf: %zu bytes (%d elements)\n", NELEMS*sizeof(sendbuf[0]), NELEMS);
  }

  // Call is same for root and others, so all ranks call the same function
  // We scatter NELEMS_PER_RANK to each rank (of MPI_INT type each element) from ROOT's sendbuf to each rank's recvbuf
  MPI_Scatter(sendbuf, NELEMS_PER_RANK, MPI_INT, recvbuf, NELEMS_PER_RANK, MPI_INT, ROOT, MPI_COMM_WORLD);

  // All ranks check they received the data they were supposed to
  for (int i=0; i < NELEMS_PER_RANK; i++) {
    if (rank != recvbuf[i]) {
      printf("Rank %d has NOT received the expected data (instead received recvbuf[%d]=%d)\n", rank, i, recvbuf[i]);
    }
  }

  if (rank == ROOT){
    printf("Rank %d finished at t = %8.4f ms\n", ROOT, MPI_Wtime()*1e3);
  }

  MPI_Finalize();

  if (rank == ROOT){
    printf("Rank %d passed MPI_Finalize\n", ROOT);
  }

  free(sendbuf);
  free(recvbuf);
  return 0;
}
jpkenny commented 3 years ago

Yes, I'm sure you were overrunning the stack size in the user space threads that macro uses to model application tasks. The stack size can be increased, but malloc is probably the better way to go. Sorry we weren't able to get to the issue sooner, people are spread pretty thin at the moment...

afranques commented 3 years ago

Ahh, I see, that makes sense. Thank you very much for your reply, @jpkenny! And please don't be sorry at all, we are lucky to have you even look at these things! I should have tried the malloc earlier; bugs always seem so obvious in retrospect haha