systemd / systemd

The systemd System and Service Manager
https://systemd.io
GNU General Public License v2.0
13.1k stars 3.75k forks source link

PrivateBin / PrivateEtc / PrivateLib feature #6460

Closed topimiettinen closed 6 years ago

topimiettinen commented 7 years ago

Submission type

Firejail, a security sandbox for user programs, can prepare on fly temporary file systems, which include only the required files for a contained program. With its private-bin option, program(s) will be copied to tmpfs file system mounted at /bin (and other places). In the same fashion, with private-etc option, files and directories from system /etc can be copied to temporary /etc directory. I have also submitted a patch to implement a new private-lib feature to copy the dynamic libraries needed by the programs to a temporary /lib. Firejail also has private-tmp matching systemd PrivateTmp directive and also private home directory features but these are not of interest here.

Systemd could also benefit from a simple way to prepare on-demand minimal execution environments for the services. The corresponding directives could be PrivateBin, PrivateEtc and PrivateLib and they would implement similar copying features like Firejail has. Perhaps the files could be bind mounted instead of copied.

The advantage would be that in case of a (limited) break-in of the service, it still might be possible to contain the escalation to unlimited breach of the whole system, if the escalation happens to need access to further files in the system.

There's already InaccessiblePaths directive which (in theory) could be used for same effect but the manual effort to blacklist all non-essential files is huge. A different approach to implement this feature could be to add a new generic directive to whitelist only the paths which are to be retained, while everything else would be made inaccessible. System images supplied for RootImage can contain a minimal set of files but there's no tool to regenerate them on demand, for example at service restart when a new version of a service is installed.

ghost commented 7 years ago

We already have superior technology for this. It is called "Mandatory Access Control"

The problem is that the technology is misunderstood (even by some people we might consider to be competent). It seems to be in some peoples interest to call MAC "second level security". This is in part facilitated by the decision to not favor one MAC framework over another in Linux, essentially giving the impression that MAC is second level security, or practically optional.

It is not. Every Linux distribution should at least to some extent support at least one MAC implementation MAC is an enhancement to Unix access control. It complements it. Look at it like MAC is security patch DAC, it fixes a fundamental security issue in DAC, namely that its de-centralized access control. Just because you can choose between various MAC implementations does not mean that MAC as a whole is "second level security"

topimiettinen commented 7 years ago

I agree that a properly set up and enabled, comprehensive MAC would be superior to this feature (also superior to existing features like InaccessiblePaths and ReadOnlyPaths) and I don't think that MAC systems are by any means second level.

But I also think that new, complementary approaches to security like seccomp and restricted namespaces can go beyond old DAC and even MAC in some cases. For example, blocking privileged system calls like mount(2) and umount(2) at seccomp level mean that MAC in kernel does not come into play to a degree and a restricted file system could be very effective.

There are a lot of systems without a MAC (or the MAC is not enabled, or configured to any effective level of security for a service) and this feature would help them. MAC and systemd protections also complement each other rather than compete.

tixxdz commented 7 years ago

@topimiettinen your suggestion is good, but I am not sure how this will play with RootDirectory or eventually as you note RootImage=, I think the problem does not reside here, instead is how most users want to generate and run their apps ?

@doverride MAC frameworks are really good, but as you note there are multiple implementations, with multiple policies, they are not secondary, but there are no people to maintain such code or policies given today Linux usage cases. It is easy for me to group resources in standard directories use namespaces to hide what is not needed compared to maintain policies. Actually I think a vertical level with Seccomp and an horizontal one with namespaces is also good, and a lot easier to maintain, maybe as not as deep with MAC, but if you add some bits like NO_NEW_PRIVS etc which also works in a vertical way, the design is clear and easy to maintain.

topimiettinen commented 7 years ago

@tixxdz: It should work similarly to PrivateTmp: I don't think there's much support for that for either RootDirectory or RootImage. Of course you can bind mount /tmp into the chroot and this should be equally possible with PrivateBin etc.

ghost commented 7 years ago

@topimiettinen I agree with the "defense in depth" argument. I just have a problem with developers that use the "MAC is second level security" argument to give it less priority. While at the same time giving the illusion that (mount) name spaces provide "first level security"

Also lets not forget that this functionality requires that we associate even more privileges with the pid1 user space process increasing its surface even more.

@tixxdz I dont think you should be comparing ChromeOS NNP with MAC. I dont think NNP is a access control framework. Also I find it kind of offending that systemd treats NNP as "first level security" and at the same time treats MAC as "second level security"

NNP and SELinux do not play nice together. Systemd shoves NNP down or throats, and then refuses to take the aspect of SELinux into consideration, forcing SELinux developers to patch SystemD, if they want to stay relevant: because SystemD decided to abstract Linux with its "units" Why? because NNP is "first level security" and MAC is "second level security" ?

I dont have a problem with the fact that your increasing pid1's surface to a point where it just needs full access to everything. I just want to be able to control this creature. I don't trust the code and i don't trust the people writing the code. MAC allows me to control what pid1 can and cannot do. I do not want to have to trust pid1 to the extent that i have not control over what it can and cannot do.

tixxdz commented 7 years ago

@topimiettinen oh ok, that makes sense and we already support such cases. I can already think of the difficult case of network namespaces and exporting the name to let ip tool work as intended, and to avoid doing any network namespace related operation in systemd.

The only challenge seems that testing and maintaining such code, Firejail has a clear focus, for other complex tools it will be hard.

@doverride

@tixxdz I dont think you should be comparing ChromeOS NNP with MAC. I dont think NNP is a access control framework. Also I find it kind of offending that systemd treats NNP as "first level security" and at the same time treats MAC as "second level security"

Actually I did not compare ChromeOS NNP and MAC, and I agree that MAC with appropriate policies is superior compared to currently known features. The problem is there are multiple MAC systems without a unified interface, and if we go with SELinux is it powerful but as I said I am only speaking about code maintenance, we do not have the resources and some kernel features are supported by all distros and easy to use compared to others. systemd supports SELinux, Smack, AppArmor and I guess if there are more developers then the related code in this area will be improved.

I dont have a problem with the fact that your increasing pid1's surface to a point where it just needs full access to everything. I just want to be able to control this creature.

We do not want to incread PID1, in the Embedded world, we want a maintainable system manager not a beast. This is already unrelated, the issue is about PrivateBin / PrivateEtc / PrivateLib that can be done with the current code that is already in systemd.

ghost commented 7 years ago

@tixxdz You wanted to create a system manager then deal with it. You also deal with unix permissions do you not?. I understand that Linus decision to not favor one MAC over another might be convenient excuse for you to just ignore MAC altogether but the fact remains that MAC is part of the system that you want to manage

I don't think this is unrelated because with this patch we will be required to allow systemd to set up the name space for /etc/ /bin /lib if we want to be able to support this. Because one thing is for sure: one of the drawbacks of making this so easy is that people are going to try and use it whether it might make sense to use it or not. Because its so easy...

Also please tell me just one practical example of mount name spacing /lib. Pretty much everything needs broad access to libraries...

topimiettinen commented 7 years ago

@doverride: I don't speak for the priorities of other developers, but at least I'm already using a MAC, TOMOYO. After having rapidly created very simple MAC rules with it for all services, I can now focus on other things.

This should not require further privileges compared to other existing systemd functionality and PID1 is already omnipotent.

ghost commented 7 years ago

topimiettinen: It willl require that we allow pid1 to "mounton" /etc /bin /lib

topimiettinen commented 7 years ago

@doverride: a service may as you say need tens of libraries, but that's nothing compared to the full set of libraries in a system which may easily be in thousands. With PrivateLib the executable will be analyzed and the libraries copied automatically. If that's not sufficient (NSS or PAM), further libraries may be listed to be included.

I don't know about "mounton", you are probably thinking of some specific SELinux policy and of course they may need to adapt if they make this kind of distinctions. Making bind mounts anywhere in the file system are equal to each other from kernel perspective.

ghost commented 7 years ago

@topimiettinen o so now we also need to allow pid1 to maintain libraries?

topimiettinen commented 7 years ago

@tixxdz: about the ip tool: if a service needs further executables to work, for example for ExecStartPre, either they should be included in the list of PrivateBin or the '+' way to run them with full privileges should be used.

@doverride: what's the problem with systemd making changes to file system? There's already InaccessiblePaths etc. which could be used for the same effect.

tixxdz commented 7 years ago

@doverride

o so now we also need to allow pid1 to maintain libraries?

No, pid1 should not do that.

@topimiettinen

@tixxdz: about the ip tool: if a service needs further executables to work, for example for ExecStartPre, either they should be included in the list of PrivateBin or the '+' way to run them with full privileges should be used.

'+' should be better, and a way to provide or mount the named network namespace.

Actually maybe these features can complete ImageRoot= especially for an alternative /etc/, I do not like the idea of copying stuff, as image preparation, copying files, etc should be done by another tool, not by the same tool that runs them. However overwriting or updating configuration, pointing to another /etc2 that is easy for users to understand, set and reload the service with the previous ImageRoot= can be a win. This can improve productivity by using configurations from another source. We can achieve it with today's systemd features BindPaths= but maybe a better abstraction.

So in general setting the namespace or alternative mount points can be fine, manipulating files individually seems too much. Lets see what others think.

Thanks!

topimiettinen commented 7 years ago

@tixxdz: I don't see this as image preparation tool, but simple operations like BindPaths are just combined in user friendly manner. The problem is that listing each individual file with simple operations gets unmaintainable soon, it's nice to have something that simplifies configuration.

The tool approach could mean that ExecStartPre calls the tool to prepare the directory for RootDirectory or image for RootImage while ExecStopPost tears down the directory. That could work too but the integration would not be so good. On the positive side, the external tool could be arbitrarily complex (for example download files) whereas the integrated version would be just able to extract library information from ELF files for the /lib, create tmpfs directories and make bind mounts but not much else.

poettering commented 7 years ago

So, did I get this right, for n running services you'd have at least n tmpfs instances around, each basically containing a copy of much (most?) of the OS tree? I am not sure this is generic enough really to fit into systemd's sandbox management, simply because it comes at an excessive memory cost, no? What am I missing?

Wouldn't it be better to support something like nspawn's --ephemeral concept for service units too? nspawn's --ephemeral concept uses btrfs snapshots where possible, and reflinks as fallback for other cases, and if those aren't available either, will use plain file copies. We'd use /var/tmp as destination however, rather than newly instantiated tmpfs, as that sounds more scalable to me. Runtime behaviour should be much nicer at least on systems that do reflinks or snapshots?

topimiettinen commented 7 years ago

@poettering: The use case is not to take entire OS tree but just the minimum needed for the service. For PrivateBin, this would typically be just one executable. For PrivateLib and PrivateEtc, more files and directories are needed but it should not be in excessive amounts. If a service needs to start lots of additional programs or needs access to large amounts of data (database, web server), it might not be a good candidate for using this concept. I think copying can also be avoided with clever use of bind mounts, which would make memory need small. There could be a settable limit for amount of memory that can be used for the tmpfs, for example 16MB and perhaps multiple instances of one service could share the file system to reduce memory cost further.

But I'm not opposed to using a disk backed file system rather than a tmpfs. That could for example allow SElinux to relabel the items in the file system before the service starts. I suppose tool/generator approach would suit this model better.

poettering commented 7 years ago

@poettering: The use case is not to take entire OS tree but just the minimum needed for the service. For PrivateBin, this would typically be just one executable. For PrivateLib and PrivateEtc, more files and directories are needed but it should not be in excessive amounts.

How is determined what to copy?

topimiettinen commented 7 years ago

PrivateBin: it can be determined from ExecStart etc. lines and the user may list additional binaries. PrivateEtc: all files/directories must be identified by user. PrivateLib: determined by reading ELF library information from the binaries (not very difficult) and the user may specify additional libraries (e.g. NSS or PAM).

poettering commented 7 years ago

PrivateBin: it can be determined from ExecStart etc. lines and the user may list additional binaries.

This falls apart if those programs invoke shell scripts as many deps might not be visible to the user...

determined by reading ELF library information from the binaries (not very difficult)

This can never work. Programs use dlopen(), even if they don't intend to, due to NSS and stuff. I am sorry, but this is not going to fly... initrd implementations try to chase down things this way, but I am very sure this has no place in service management, sorry...

I think the much better approach would be to build separate, minimal images as people need. i.e. use portable services, restrict the package set you need and only include exactly what you need in the image. I mean, listing files and processing dependencies, that's a package manager thing, and I think it's OK if they do that, there's no need to reimplement all that as part of service management.

I think we should work on making running services directly off OS images nicer and workable, and provide a toolset to easily built images like that (e.g. mkosi).

topimiettinen commented 7 years ago

This falls apart if those programs invoke shell scripts as many deps might not be visible to the user...

Yes, but services which launch other programs aren't common. In my system there's systemd & udevd, acpid and of course the display manager. I could restrict other services like I do with the MAC.

This can never work. Programs use dlopen(), even if they don't intend to, due to NSS and stuff.

Still, the set of libraries that are potentially needed by NSS or PAM is orders of magnitude smaller than the entire set of libraries in the system. If the user chooses to use this option and it breaks NSS, they get to fix it but I don't think that will be so hard, they just add a couple more libraries from the usual suspects list.

But if you don't think the fully integrated version (PrivateXyz) makes sense, do you think a generic whitelisting directive with no automation whatsoever that would complement InaccessiblePaths would be OK?

sourcejedi commented 6 years ago

But if you don't think the fully integrated version (PrivateXyz) makes sense, do you think a generic whitelisting directive with no automation whatsoever that would complement InaccessiblePaths would be OK?

I might be missing something. But it seems OK enough that we implemented it now :-P.

# /etc/systemd/system/test.service
[Service]
TemporaryFileSystem=/etc
BindPaths=/etc/passwd

#ExecStart=/bin/cat /proc/self/mountinfo
ExecStart=/bin/ls -l /etc

Type=oneshot
RemainAfterExit=yes

Result:

Sep 06 11:58:51 fedora-28 systemd[1]: Starting test.service...
Sep 06 11:58:51 fedora-28 ls[354]: total 4
Sep 06 11:58:51 fedora-28 ls[354]: -rw-r--r--. 1 root 0 1337 Aug 27 19:41 passwd
Sep 06 11:58:51 fedora-28 systemd[1]: Started test.service.
topimiettinen commented 6 years ago

Yes, the issue can be closed now. Functionality equal to PrivateBin and PrivateEtc exists and detection of dynamic libraries can be easily implemented outside of systemd.