upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link
checkpoint checkpoint-restart fault-injection fault-tolerance mpi replication ulfm

Fault Tolerance Framework for High Performance Computing

Build Status

This project is a fault tolerance framework for parallel applications. Below is a list of supported features by this framework.

This framework is still under development phase and not production ready. So please use it carefully.

# Features Status
1 ULFM Enabled master (OK)
2 Process Replication master (OK)
3 Full-context Application Level Checkpointing master (OK)
4 Fault Injector master (OK)
5 Process Manager master (OK)
6 Custom Compiler (around mpicc) wip
7 User Level Checkpointing future

MPI Functions Supported

# MPI Function Status
1 MPI_Init master (OK)
2 MPI_Finalize master (OK)
3 MPI_Barrier master (OK)
4 MPI_Comm_rank master (OK)
5 MPI_Comm_size master (OK)
6 MPI_Send master (OK)
7 MPI_Recv master (OK)
8 MPI_Scatter master (OK)
9 MPI_Gather master (OK)
10 MPI_Bcast master (OK)
11 MPI_Allgather master (OK)
12 MPI_Reduce master (OK)
13 MPI_Allreduce master (OK)
14 MPI_Isend master (OK)
15 MPI_Irecv master (OK)
16 MPI_Wait master (OK)
17 MPI_* (other async calls) future

API

MPI program's are supported by default but you need to link your program using libreplication.so after installing EntangledMPI.

Addition to this malloc and free wrappers are avalaible which should be used instead of malloc and free and a function to copy addresses from one pointer to another.

Example

int *ptr, *dup_ptr;
rep_malloc(&ptr, sizeof(int));

*ptr = 10;

// Source -> Dest
rep_assign_malloc_context(&ptr, &dup_ptr);

rep_free(&ptr);
ptr = NULL;
dup_ptr = NULL;

Building EntangledMPI

Requirement

  1. Autoconf >= 2.52
  2. Automake >= 1.6.0
  3. Libtool >= 2.4.2
  4. MPI Compiler (mpicc) with ULFM support
./autogen.sh
mkdir build && cd build
../configure --disable-stack-protector [CC=mpicc] [--prefix=<directory>]
# CC=mpicc is optional but might be required in some situations
make [-j N]
# use an integer value of N for parallel builds
make install

Note: $GOPATH should be set to '<ROOT>/EntangledMPI'

Usage

This framework will need a "replication.map" file which will contain the rank mappings. This could be generated using manager command given with this framework.

Using manager

# use 'manager --help' for options
manager -j 2 -r 3

Above command will generate the below file. This command should be executed in the directory where your executable reside..

3   2
1   0   1   0
1   1   2   2   1
[first line: <TOTAL_CORES><TAB><NO_OF_JOBS>] 
[followed by (in each line): <UPDATE_BIT><TAB><JOB_ID><TAB><NO_OF_WORKERS><TAB><ORIGINAL_RANK_1><TAB><ORIGINAL_RANK_2><TAB>...]
<JOB_ID> starts with 0
<ORIGINAL_RANK_*> starts with 0

Fault Injector

fault_injector command can be used to induce fault at different time intervals governed by different distributions. fault_injector will require replication.map generated by manager and network.stat generated by EntangledMPI during MPI_Init.

network.stat is used to identify process ID and hostname.

Checkpointing

You need to create a directory named "ckpt" in the directory where your executable reside to store checkpointing files. This will be taken care in the future by the process manager.

Checkpoint is automatically created when manager updates replication.map file. EntangledMPI automatically check for ckpt/rank-*.ckpt file. If it exist it loads the checkpoint into memory and starts the program from that point.

So you can run your program normally as you did without checkpoint.

Examples and Tutorials

This framework is compatible with all MPI programs you only need to replace all your malloc calls to rep_malloc and free to rep_free and link your program with this library.

You can also view the test directory for some example MPI programs.

IMPORTANT NOTE:

This framework will only run if the program is compiled dynamically i.e. using "-dynamic"flag, with stack protection disabled i.e. using "-fno-stack-protector" flag and disabled ASLR (Address space layout randomization) using "sudo sysctl kernel.randomize_va_space=0" to disable ASLR temporarily.

Stackoverflow help