upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

Pointer passing in functions will fail replication and checkpoint #5

Closed upperwal closed 6 years ago

upperwal commented 6 years ago

Pointer passed to functions are stored in stack. As the stack is copied value of these pointer ie. memory addressed allocated using malloc will also be copied but it is not necessary that these memory locations are allocated in replica.

This will give seg fault.

Possible Solution: Use mmap to allocated exact same heap memory as before.

upperwal commented 6 years ago

Tried different implementations of malloc but failed. No solution to this problem right now.

mmap does not work as it cannot allocate the exact memory address.

A solutions would be to implement our own implementation of malloc but that wont be as efficient as the existing one. We can even try to fork the existing malloc (ptmalloc2) and change it accordingly. But even this isn't easy.

upperwal commented 6 years ago

Feasible Solution: Pass address of container (pointer) instead of pointer value (address by malloc) and use * to get back the address by malloc.

int *buf;
rep_malloc(&buf, sizeof(int));
buf[0] = 9;

MPI_Send(&buf, ...);
// instead of MPI_Send(buf, ...);
upperwal commented 6 years ago

Done. Ref. to f6014034d3c6a4db735bdc90a5478444a2684a61

Added two varibales which the user should use (or will be taken care by the custom compiler) in order to tell EntangedMPI whether container address is passed to MPI_* function or a normal buffer address.

One variable is to tell about the sender buffer and the other one for receiver buffer.

https://github.com/upperwal/EntangledMPI/blob/188004cb4f1bf91c3a292d2f5dda9be23c0d5000/src/mpi/init.c#L47-L53

These could be used as below:

**For normal buffer*** https://github.com/upperwal/EntangledMPI/blob/188004cb4f1bf91c3a292d2f5dda9be23c0d5000/test/rep_collective_test.c#L114-L118

To send pointer to variable (instead send the pointer address) https://github.com/upperwal/EntangledMPI/blob/188004cb4f1bf91c3a292d2f5dda9be23c0d5000/test/rep_collective_test.c#L132-L135