Open dimakuv opened 2 years ago
Important features of the current system:
Modular: it's easy to define checkpoint/restore procedures for a new object type.
Handles arbitrary graphs: the system can handle objects referring to other objects, even if an object is referred to multiple times, or if there are cycles.
Better API: a big pain point with the current system is that it's heavily obfuscated by macros (for function names, argument lists, individual operations, and even returning in case of errors). That should not be necessary if we define a right set of operations.
Explicit data handling: there should be no implicitly copied fields, or "memcpy
and fixup". A checkpoint procedure should do something with every field: either copy the data directly (makes sense for scalars), recursively invoke checkpointing for another object (makes sense for pointers to other objects), or do something custom.
No linker hacks: We define a .migratable
section to copy some data directly (but it's almost unused and should be easy to remove). Also, we create separate sections for checkpoint/restore procedures so that we don't have to define any global array for them in code (instead we use an index into the section). I think it would be better to accept code duplication, and make things simpler under the hood.
Stream-based: operations should not operate on a copied blob of data, but read/write data to a stream (we now do that with VMA). This solves a number of problems that we have with the blob-based design (difficulty implementing realloc
and free
, size limits, memory that's impossible to deallocate).
Customizable by filesystems: right now there is still some filesystem-specific logic in the checkpoint procedures for dentry and handle. It would be cleaner to define this custom logic in filesystem callbacks. We have icheckpoint
and irestore
for inodes, but they're deliberately kept separate from the checkpoint/restore system (since it's hard to use "from outside"). That should change once the system has better API.
Stop other threads while checkpointing: currently, other threads might change data while we checkpoint it, possibly resulting in an inconsistent "snapshot". Ideally, we should stop other threads while checkpointing. We might get away with stopping just Gramine (prevent other threads from entering "kernel mode" by taking an exclusive lock), but keep in mind that we also checkpoint user memory. This is made more difficult by the fact that some I/O operations in Gramine are currently not interruptible.
Remember that PAL has its own checkpointing system for handles; it doesn't work very well for more complicated cases (like protected files). If we cannot remove it, it should be as simple as possible (with LibOS doing most of the work).
Maybe it should be possible to use the system in other contexts? We said before that it could be useful for synchronizing data between processes. But it might not be easy to extend the system to update existing data instead of copying it.
This describes the current state of the checkpoint-restore refactoring project.
Legend:
:heavy_check_mark: Done (merged to master) :construction: In progress (usually has a PR open) :star: Next (usually will be unlocked by current "in progress")
Bug fixes and new features
Gramine leaks FDs of named pipes
Gramine can leak FDs (i.e., have holes in the FD map) if the user app never actually opens the named pipe.
pipe/fs.c
file contains the implementation of named pipes (fifo
s). We create two temporary handles for read and write end of pipe, with corresponding PAL handles, and put them in process FD table.It would be better to store the temporary handles directly, without allocating FDs for them. However, using FDs makes it easier to checkpoint a named pipe. So the checkpoint rewrite should consider this corner case.
Support for SCM_RIGHTS
In Linux, it is possible to send/receive FDs via
SCM_RIGHTS
on a UNIX domain socket. In Gramine, this would require a way to checkpoint one particular FD and its associated shim handle & PAL handle.See comments in https://github.com/gramineproject/graphene/pull/1511.