Closed kakulo closed 4 years ago
Hi Lenny,
Thanks for your interest on FTI. Yes, your description of the 4 levels of checkpoint of FTI is accurate. One thing to add is that for the local checkpoint it does not need to be done on SSDs, it can be stored on NVM, or even in memory (for instance using /dev/shmem). This has been tested in multiple supercomputers around the world, including MIRA at ANL, which did not have SSDs, so we performed the checkpoints in memory. Hope this helps, Leo
Hey Leo,
I am using FTI in a project and writing a project report. We are not very sure on one thing that can the L1 checkpointing be considered an in-memory checkpointing? If not, is FTI supporting in-memory checkpointing?
Furthermore, can I rely on the following description of FTI L1, L2, L3, and L3 checkpointing as an accurate description? If not, could you please give me a current description. Thank you very much!
L1 L1 denotes the first safety level in the multilevel checkpointing strategy of FTI. The checkpoint of each process is written on the local SSD of the respective node. This is fast but possesses the drawback, that in case of a data loss and corrupted checkpoint data even in only one node, the execution cannot successfully restarted. L2 L2 denotes the second safety level of checkpointing. On initialisation, FTI creates a virtual ring for each group of nodes with user defined size (see group_size). The first step of L2 is just a L1 checkpoint. In the second step, the checkpoints are duplicated and the copies stored on the neighbouring node in the group. That means, in case of a failure and data loss in the nodes, the execution still can be successfully restarted, as long as the data loss does not happen on two neighbouring nodes at the same time. L3 L3 denotes the third safety level of checkpointing. In this level, the check- point data trunks from each node getting encoded via the Reed-Solomon (RS) erasure code. The implementation in FTI can tolerate the breakdown and data loss in half of the nodes. In contrast to the safety level L2, in level L3 it is irrelevant which of nodes encounters the failure. The missing data can get reconstructed from the remaining RS-encoded data files. L4 L4 denotes the fourth safety level of checkpointing. All the checkpoint files are flushed to the parallel file system (PFS).
Stay well, Lenny