Open pwmarcz opened 2 years ago
I was looking at the possibility of removing libos_handle::dentry
field. Unfortunately, this is still far away from being possible.
There are two Gramine problems that make removing dentry
from the handle object (and using inodes instead) complex:
handle->dentry
in favor of handle->inode
.The design problem is hard to fix, as Pawel explained in this issue. Also, it will have high performance overhead, if child processes must constantly check for updates on the main process (or vice versa, if the main process broadcasts updates to children).
On the good side, I think the only problematic places in Gramine currently are:
vma->file->dentry
-- probably easy to fix (move to inodes)g_process.exec->dentry
-- probably easy to fix (move to inodes)
This is a collection of notes about what I've learned working on Gramine's FS code: I'm leaving active Gramine development, so hopefully this will be useful for others.
Goals
Some plausible scenarios that I'm assuming we might have for synchronization:
O_APPEND
host file: multiple processes writing to a file, e.g. a log file. This is not possible now, because all file accesses are absolute, so processes will overwrite each other's data.Shared encrypted files: writing to an encrypted (protected) file from multiple processes, e.g. an SQLite database. Currently, we assume that only one process opens a file at a time, so if two processes write to it, they're likely going to corrupt the file.
Shared tmpfs files: same, but for an in-memory system. These are currently separate for each process.
File locks (
fcntl
,flock
): We currently implementfcntl
locks, but they're non-interruptible, which limits their usefulness.~Sync engine~
Before (see https://github.com/gramineproject/graphene/issues/2158), I proposed and started to implement a "sync engine", a module based on the idea of synchronizing arbitrary data between processes. My thinking was that we could optimize the uncontested case (i.e. a single process doing most of the work) by keeping track of which process has the latest version at the moment.
I no longer think this is a good idea: the implementation ended up extremely over-engineered, with complicated flow of messages being passed around, even before I got to more advanced features like exchanging non-trivial data, or more complicated wait conditions, or interruptible waits.
I believe that good solutions for Gramine will be:
The idea that I think is worth keeping is relying on the process leader as the "server" that keeps all data.
Remember less data
We used to have a problem that when a (host) file got added or removed by another process, Gramine did not notice that. That was because we kept the files in dentry cache and relied on that data.
The (easy!) solution turned out to be do not rely on cache so much, but update data every time. For instance, each
listdir
operation actually calls host to list the directory again. If a new file appeared, we fill a dentry; if it disappeared, we clear a dentry.This might be applicable in other situations as well: when in doubt, load the data from host.
Use Linux sources for inspiration
Actually, the easy solution described above was made possible by introducing inodes (#279). Before, we couldn't just clear a dentry so easily, because it represented a possibly open file.
More generally, I learned a lot by studying real sources of filesystem code in Linux: how dentries and inodes work, what kind of mutexes it uses and in what order, how
fcntl
locks are implemented, what callbacks it uses for the filesystem (e.g. position-independentread
).(I also looked at older, simpler versions of Linux, and at FreeBSD).
I'm not saying to blindly follow Linux: Gramine solves a different problem, and can implement many things in a simpler way. But it's a good starting point. Things are done in Linux this way for a good reason.
Support append mode on host?
Is writing to a (non-encrypted) host file a common use case? For instance, multiple processes logging to a file, probably opened with
O_APPEND
.If so, then I think the best course of action is to implement real append mode in PAL, i.e. allow opening files in append mode. We haven't done it so far, I think because stateless operations (write at offset) are more "pure" and deterministic. However, this is a good place to compromise on that principle: append mode is a much better, simpler solution than any kind of synchronization between processes.
Serve files from process leader?
For shared encrypted files, or shared tmpfs files, I think it's worth investigating a client-server model: the "server", i.e. the host process, would make these files available to other processes over IPC.
I admit I haven't thought that through in detail; it's possible that this is also too complicated to consider. I would probably start by examining "prior art": NFS, FUSE, and the 9P protocol which promises to be simple.
fcntl
locksI implemented
fcntl
locks (https://github.com/gramineproject/graphene/pull/2481) in this client-server model: the process leader keeps information about the locks, and other processes use IPC for locking and unlocking. I think that might be a good starting point for further work on synchronization, but there are some problems that came up.Identifying a file: how do I tell the process leader which file to operate on? The current implementation uses absolute paths (like
/foo/bar
) and thus stores information in dentries, not inodes. That's perhaps good enough, but it means corner cases around deleting or renaming a file are not handled correctly.A perhaps related problem is to have consistent inode numbers between processes. Right now, inode numbers are derived from absolute paths using a deterministic function. That mostly works, but it gives no guarantee that there won't be a collision, and renaming a file changes its inode number.
Interruptible operations: The current implementation uses a "send IPC message and wait for response" primitive, but this is wrong: there is no easy way to interrupt waiting, so taking a lock can actually hang forever in Gramine. The primitive in question was not even meant for such cases.
To support interruptible requests like this, you probably need separate operations like "make a request", "wait for response", and "cancel my request". See #12 for discussion (and https://github.com/gramineproject/graphene/pull/2522 for a failed attempt at fix).
Boilerplate: The implementation is simpler than the "sync engine", but still required a lot of boilerplate code. We can probably do better.