google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.9k stars 1.31k forks source link

Running with fully `mlock`ed memory #10530

Open EtiennePerot opened 5 months ago

EtiennePerot commented 5 months ago

Description

Add a mode whereby all gVisor memory pages, including those of the sandboxed application, are mlocked (i.e. they cannot be paged out to swap).

This is useful for situations where it is desirable to leave no trace of the sandboxed workload on the host system.

Is this feature related to a specific bug?

See discussion here. Dangerzone is a project from the Freedom of the Press Foundation which handles potentially-dangerous, potentially-sensitive/confidential documents. As part of its document processing, it needs to run several applications and libraries (LibreOffice, PyMuPDF, etc.) which don't all support such a "traceless" mode. However, these applications run within gVisor, with all data only written to tmpfs mounts (which are backed by gVisor memory). Therefore, if we can mlock all gVisor memory pages, we can systematically guarantee that no traces of the document will be left on the host system.

Do you have a specific solution in mind?

Per @nixprime (thanks!), there are two possible solutions:

EtiennePerot commented 5 months ago

I've locally implemented the first solution (as a platform-specific feature) and confirmed that it works by checking that /proc/<pid>/smaps_rollup's Locked field shows the gVisor threads having (most of) their memory locked. In practice there are still some non-shared parts of memory from my local testing, most corresponding to code pages of shared libraries used by other applications on my system, and vdso/vsyscall pages. The Systrap memory file is entirely mlocked, so all memory allocated by the sandboxed application at runtime is mlocked.

I also briefly looked at the second solution but ran into the problem of how to actually mlock the range in MemoryFile.Allocate. For that to work, IIUC it needs the memory range to mlock to be part of the address space, whereas (at least in the Systrap case) all that memory only lives behind the memfd which by default is not entirely mapped in the Sentry's address space. This means that the Sentry would need to keep track of extra mappings in its own address space for this memory along with the FileRanges. That seems like it would add quite a bit of complexity. But perhaps I am missing something.

The rest of the work left for the first solution is to clean it up and make it a proper configurable option. This means for example that the mlockall system call should not be allowed through the seccomp-bpf filters in the normal mode; it should only be allowed in this fully-mlocked mode.

nixprime commented 5 months ago

I also briefly looked at the second solution but ran into the problem of how to actually mlock the range in MemoryFile.Allocate. For that to work, IIUC it needs the memory range to mlock to be part of the address space, whereas (at least in the Systrap case) all that memory only lives behind the memfd which by default is not entirely mapped in the Sentry's address space.

After a5573312e02c ("Add explicit huge page and memory recycling support to pgalloc.MemoryFile"), the memfd is eagerly mapped in the Sentry's address space (as it expands); before that change it was mapped lazily. In any case, mappings are most easily accessed using MemoryFile.MapInternal(), e.g. in the path through https://github.com/google/gvisor/blob/d59375d82e6301c08634e5d38c424fcf728ccda5/pkg/sentry/pgalloc/pgalloc.go#L709 MemoryFile.Allocate() gets mappings and invokes either MADV_POPULATE_WRITE (tryPopulate() => tryPopulateMadv()) or mlock+munlock (tryPopulate() => tryPopulateMlock()) on them depending on availability. MemoryFile.forEachMappingSlice() is slightly faster than MemoryFile.MapInternal(); in particular, if the FileRange being iterated spans more than one chunk (aligned 1GB), then forEachMappingSlice() avoids allocating a slice to back the safemem.BlockSeq.