Building the current Containerfile would not be possible due to NixOS's sandboxing. Building a NixOS based OCI container would be possible, but the only advantage a container would provide on NixOS is sandboxing.
Systemd has a plethora of sandboxing options available, including the same facilities (namespaces) that containers use. I believe the systemd service introduced in this commit should actually be more secure than a regular container, given that it has all the same namespacing, plus many other things, for example a very restrictive seccomp filter.
Below is the output of systemd-analyze security for the service. There are still a few things that it deducts points for, but overall we are very close to a perfect score:
RestrictAddressFamilies=~AF_(INET|INET6), PrivateNetwork=, IPAddressDeny=: We could use unix sockets instead for proxying, but setting it up is non-trivial, so this is left as future work.
ProtectSystem=, ProtectHome=: ProtectSystem is incompatible with the NixOS confinement option. Given that the service runs in a separate mount namespace with a separate root filesystem, these should not matter.
DeviceAllow=: ProtectClock= implicitly adds char-rtc:r here for some reason.
NAME DESCRIPTION EXPOSURE
✓ SystemCallFilter=~@swap System call allow list defined for service, and @swap is not included
✓ SystemCallFilter=~@resources System call allow list defined for service, and @resources is not included
✓ SystemCallFilter=~@reboot System call allow list defined for service, and @reboot is not included
✓ SystemCallFilter=~@raw-io System call allow list defined for service, and @raw-io is not included
✓ SystemCallFilter=~@privileged System call allow list defined for service, and @privileged is not included
✓ SystemCallFilter=~@obsolete System call allow list defined for service, and @obsolete is not included
✓ SystemCallFilter=~@mount System call allow list defined for service, and @mount is not included
✓ SystemCallFilter=~@module System call allow list defined for service, and @module is not included
✓ SystemCallFilter=~@debug System call allow list defined for service, and @debug is not included
✓ SystemCallFilter=~@cpu-emulation System call allow list defined for service, and @cpu-emulation is not included
✓ SystemCallFilter=~@clock System call allow list defined for service, and @clock is not included
✓ RemoveIPC= Service user cannot leave SysV IPC objects around
✓ User=/DynamicUser= Service runs under a static non-root user identity
✓ RestrictRealtime= Service realtime scheduling access is restricted
✓ CapabilityBoundingSet=~CAP_SYS_TIME Service processes cannot change the system clock
✓ NoNewPrivileges= Service processes cannot acquire new privileges
✓ AmbientCapabilities= Service process does not receive ambient capabilities
✓ CapabilityBoundingSet=~CAP_BPF Service may load BPF programs
✓ SystemCallArchitectures= Service may execute system calls only with native ABI
✗ RestrictAddressFamilies=~AF_(INET|INET6) Service may allocate Internet sockets 0.3
✓ ProtectProc= Service has restricted access to process tree (/proc hidepid=)
✓ SupplementaryGroups= Service has no supplementary groups
✓ CapabilityBoundingSet=~CAP_SYS_RAWIO Service has no raw I/O access
✓ CapabilityBoundingSet=~CAP_SYS_PTRACE Service has no ptrace() debugging abilities
✓ CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE) Service has no privileges to change resource use parameters
✓ CapabilityBoundingSet=~CAP_NET_ADMIN Service has no network configuration privileges
✓ CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has no elevated networking privileges
✓ CapabilityBoundingSet=~CAP_AUDIT_* Service has no audit subsystem access
✓ CapabilityBoundingSet=~CAP_SYS_ADMIN Service has no administrator privileges
✓ PrivateTmp= Service has no access to other software's temporary files
✓ ProcSubset= Service has no access to non-process /proc files (/proc subset=)
✓ CapabilityBoundingSet=~CAP_SYSLOG Service has no access to kernel logging
✓ PrivateDevices= Service has no access to hardware devices
✓ RootDirectory=/RootImage= Service has its own root directory/image
✗ ProtectSystem= Service has full access to the OS file hierarchy 0.2
✗ PrivateNetwork= Service has access to the host's network 0.5
✗ ProtectHome= Service has access to fake empty home directories 0.1
✗ DeviceAllow= Service has a device ACL with some special devices: char-rtc:r 0.1
✓ KeyringMode= Service doesn't share key material with other services
✓ Delegate= Service does not maintain its own delegated control group subtree
✓ PrivateUsers= Service does not have access to other users
✗ IPAddressDeny= Service defines IP address allow list with only localhost entries 0.1
✓ NotifyAccess= Service child processes cannot alter service state
✓ ProtectClock= Service cannot write to the hardware clock or system clock
✓ CapabilityBoundingSet=~CAP_SYS_PACCT Service cannot use acct()
✓ CapabilityBoundingSet=~CAP_KILL Service cannot send UNIX signals to arbitrary processes
✓ ProtectKernelLogs= Service cannot read from or write to the kernel log ring buffer
✓ CapabilityBoundingSet=~CAP_WAKE_ALARM Service cannot program timers that wake up the system
✓ CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER) Service cannot override UNIX file/IPC permission checks
✓ ProtectControlGroups= Service cannot modify the control group file system
✓ CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE Service cannot mark files immutable
✓ CapabilityBoundingSet=~CAP_IPC_LOCK Service cannot lock memory into RAM
✓ ProtectKernelModules= Service cannot load or read kernel modules
✓ CapabilityBoundingSet=~CAP_SYS_MODULE Service cannot load kernel modules
✓ CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG Service cannot issue vhangup()
✓ CapabilityBoundingSet=~CAP_SYS_BOOT Service cannot issue reboot()
✓ CapabilityBoundingSet=~CAP_SYS_CHROOT Service cannot issue chroot()
✓ PrivateMounts= Service cannot install system mounts
✓ CapabilityBoundingSet=~CAP_BLOCK_SUSPEND Service cannot establish wake locks
✓ MemoryDenyWriteExecute= Service cannot create writable executable memory mappings
✓ RestrictNamespaces=~user Service cannot create user namespaces
✓ RestrictNamespaces=~pid Service cannot create process namespaces
✓ RestrictNamespaces=~net Service cannot create network namespaces
✓ RestrictNamespaces=~uts Service cannot create hostname namespaces
✓ RestrictNamespaces=~mnt Service cannot create file system namespaces
✓ CapabilityBoundingSet=~CAP_LEASE Service cannot create file leases
✓ CapabilityBoundingSet=~CAP_MKNOD Service cannot create device nodes
✓ RestrictNamespaces=~cgroup Service cannot create cgroup namespaces
✓ RestrictNamespaces=~ipc Service cannot create IPC namespaces
✓ ProtectHostname= Service cannot change system host/domainname
✓ CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP) Service cannot change file ownership/access mode/capabilities
✓ CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP) Service cannot change UID/GID identities/capabilities
✓ LockPersonality= Service cannot change ABI personality
✓ ProtectKernelTunables= Service cannot alter kernel tunables (/proc/sys, …)
✓ RestrictAddressFamilies=~AF_PACKET Service cannot allocate packet sockets
✓ RestrictAddressFamilies=~AF_NETLINK Service cannot allocate netlink sockets
✓ RestrictAddressFamilies=~AF_UNIX Service cannot allocate local sockets
✓ RestrictAddressFamilies=~… Service cannot allocate exotic sockets
✓ CapabilityBoundingSet=~CAP_MAC_* Service cannot adjust SMACK MAC
✓ RestrictSUIDSGID= SUID/SGID file creation by service is restricted
✓ UMask= Files created by service are accessible only by service's own user by default
→ Overall exposure level for latex_templater.service: 1.1 OK 🙂
Building the current Containerfile would not be possible due to NixOS's sandboxing. Building a NixOS based OCI container would be possible, but the only advantage a container would provide on NixOS is sandboxing.
Systemd has a plethora of sandboxing options available, including the same facilities (namespaces) that containers use. I believe the systemd service introduced in this commit should actually be more secure than a regular container, given that it has all the same namespacing, plus many other things, for example a very restrictive seccomp filter.
Below is the output of
systemd-analyze security
for the service. There are still a few things that it deducts points for, but overall we are very close to a perfect score:RestrictAddressFamilies=~AF_(INET|INET6)
,PrivateNetwork=
,IPAddressDeny=
: We could use unix sockets instead for proxying, but setting it up is non-trivial, so this is left as future work.ProtectSystem=
,ProtectHome=
:ProtectSystem
is incompatible with the NixOS confinement option. Given that the service runs in a separate mount namespace with a separate root filesystem, these should not matter.DeviceAllow=
:ProtectClock=
implicitly addschar-rtc:r
here for some reason.