lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
403 stars 72 forks source link

Add knob(s) to limit Netfilter, Netlink, or all CAP_NET_ADMIN access from containers #331

Open solardiz opened 2 months ago

solardiz commented 2 months ago

There's a constant stream of kernel vulnerabilities, including e.g. CVE-2024-1086 recently, in Netfilter as exposed to users due to containers - such as user and network namespaces created by a host user specifically to perform the attack (exploits programs invoke unshare on their own). The only mitigations with upstream and Red Hat kernels are user.max_user_namespaces=0, user.max_net_namespaces=0, or blacklisting Netfilter kernel module(s). Unfortunately, these break commonly needed functionality. Ubuntu/AppArmor is able to disable just unprivileged users' creation of namespaces, which breaks only a little bit less.

We could want to invent a knob of our own that would limit access only to Netfilter and only in containers (user/network namespaces). Further, it could support an intermediate setting where it'd disallow Netfilter in nested containers, but leave it allowed (and exposed for attack, unfortunately) in top-level containers. A use case mentioned to me is:

The most obvious use cases I'm thinking of are Kubernetes in Docker by example, KinD container will run kubernetes inside it and kubernetes is using netfilter for kube-proxy

solardiz commented 2 months ago

In terms of implementation, we'd probably need to hook nfnetlink_rcv (not exported and static, but accessed via function pointer, so should be intact), but a problem is with our current kretprobe hooks we "can't" prevent the original function from being called and I don't see a non-invasive way to make it a no-op for one call.

It uses netlink_net_capable(skb, CAP_NET_ADMIN), which makes me think of whether we possibly want to have a knob to restrict access to all of Netlink instead? Which we could perhaps by hooking __netlink_ns_capable (exported).

And this makes me further think of whether we could have a knob to restrict all uses of CAP_NET_ADMIN in non-init namespaces, which we could do from the security_capable LSM hook as used by ns_capable_common (the latter is not exported, static). We already hook security_capable for task integrity checking and pCFI (we hook it via kretprobe for consistency with our other hooks, not the way it was meant to be hooked). So, if we're fine with not limiting this to Netfilter nor even Netlink, what we could do is add a check of security_capable arguments 2 and 3 (namespace and capability) in our p_capable_ret (or switch to proper LSM hooking).

A question is then why would a sysadmin want to allow user+network namespaces then. A possible reason why is that apparently network namespaces are sometimes used (by some systemd services) to give up network access, which I guess would continue to work without a usable CAP_NET_ADMIN in there. Another reason is our knob could allow to make CAP_NET_ADMIN ineffective only starting with a certain namespace nesting depth (the sysctl value).