Open EtiennePerot opened 5 months ago
I've locally implemented the first solution (as a platform-specific feature) and confirmed that it works by checking that /proc/<pid>/smaps_rollup
's Locked
field shows the gVisor threads having (most of) their memory locked. In practice there are still some non-shared parts of memory from my local testing, most corresponding to code pages of shared libraries used by other applications on my system, and vdso/vsyscall pages. The Systrap memory file is entirely mlock
ed, so all memory allocated by the sandboxed application at runtime is mlock
ed.
I also briefly looked at the second solution but ran into the problem of how to actually mlock
the range in MemoryFile.Allocate
. For that to work, IIUC it needs the memory range to mlock
to be part of the address space, whereas (at least in the Systrap case) all that memory only lives behind the memfd which by default is not entirely mapped in the Sentry's address space. This means that the Sentry would need to keep track of extra mappings in its own address space for this memory along with the FileRange
s. That seems like it would add quite a bit of complexity. But perhaps I am missing something.
The rest of the work left for the first solution is to clean it up and make it a proper configurable option. This means for example that the mlockall
system call should not be allowed through the seccomp-bpf filters in the normal mode; it should only be allowed in this fully-mlock
ed mode.
I also briefly looked at the second solution but ran into the problem of how to actually
mlock
the range inMemoryFile.Allocate
. For that to work, IIUC it needs the memory range tomlock
to be part of the address space, whereas (at least in the Systrap case) all that memory only lives behind the memfd which by default is not entirely mapped in the Sentry's address space.
After a5573312e02c ("Add explicit huge page and memory recycling support to pgalloc.MemoryFile"), the memfd is eagerly mapped in the Sentry's address space (as it expands); before that change it was mapped lazily. In any case, mappings are most easily accessed using MemoryFile.MapInternal()
, e.g. in the path through https://github.com/google/gvisor/blob/d59375d82e6301c08634e5d38c424fcf728ccda5/pkg/sentry/pgalloc/pgalloc.go#L709 MemoryFile.Allocate()
gets mappings and invokes either MADV_POPULATE_WRITE
(tryPopulate() => tryPopulateMadv()
) or mlock+munlock
(tryPopulate() => tryPopulateMlock()
) on them depending on availability. MemoryFile.forEachMappingSlice()
is slightly faster than MemoryFile.MapInternal()
; in particular, if the FileRange
being iterated spans more than one chunk (aligned 1GB), then forEachMappingSlice()
avoids allocating a slice to back the safemem.BlockSeq
.
Description
Add a mode whereby all gVisor memory pages, including those of the sandboxed application, are
mlock
ed (i.e. they cannot be paged out to swap).This is useful for situations where it is desirable to leave no trace of the sandboxed workload on the host system.
Is this feature related to a specific bug?
See discussion here. Dangerzone is a project from the Freedom of the Press Foundation which handles potentially-dangerous, potentially-sensitive/confidential documents. As part of its document processing, it needs to run several applications and libraries (LibreOffice, PyMuPDF, etc.) which don't all support such a "traceless" mode. However, these applications run within gVisor, with all data only written to
tmpfs
mounts (which are backed by gVisor memory). Therefore, if we canmlock
all gVisor memory pages, we can systematically guarantee that no traces of the document will be left on the host system.Do you have a specific solution in mind?
Per @nixprime (thanks!), there are two possible solutions:
mlockall(MCL_CURRENT|MCL_FUTURE|MCL_ONFAULT)
during boot. This works naturally for the KVM platform, but for Systrap we need to also call this for subprocesses, as it is not inherited acrossfork()
. Because of the platform-specificity of this solution, this also requires adding some platform method to indicate whether it supports this feature. This solution also requires some hook inpgalloc
'smlockDisabled
to always befalse
in such a mode, such thattryPopulateMlock
(which callsmunlock
) is skipped.pgalloc.MemoryFile.MlockAllocated
, which causesMemoryFile.Allocate
to ignoreAllocOpts.Mode
and alwaysmlock
the allocated range, and causesMemoryFile.runReclaim
tomunlock
the range. Need to confirm thatfallocate(..., FALLOC_FL_PUNCH_HOLE, ...)
works onmlock
'd memory (so thatMemoryFile.decommitFile
still works), and still need to callmlock(MCL_CURRENT|MCL_FUTURE|MCL_ONFAULT)
during boot if we want tomlock
the Go heap/stack data (which could contain sensitive things like filenames, env variables etc.)