linux-application-whitelisting / fapolicyd

File Access Policy Daemon
GNU General Public License v3.0
199 stars 56 forks source link

fapolicy won't stop randomly blocking #205

Closed ccravens closed 2 years ago

ccravens commented 2 years ago

My apologies I write this issue with a bit of frustration as this has been going on for months and I've not been able to understand why fapolicy seems to randomly block running a specific file even though I've added it as a rule and to the trust database.

I'm trying to run RKE2 on 9 nodes. I apply the same fapolicy configurations to all nodes in the cluster. I'm able to run hundreds of pods which all invoke runc, however randomly on one or two nodes runc will just be blocked by fapolicy at what seem to be random times. The launching of the cluster and configurations are 100% automated by Ansible, but sometimes when I deploy the cluster it works fine, othertimes there are blocks. Here are my configurations:

[root@ip-192-168-96-10 ~]# cat /etc/fapolicyd/rules.d/01-app.rules 
# uids
%uuids=0,1000

# Run RKE2
allow perm=any all : dir=/opt/cni/
allow perm=any all : dir=/run/k3s/
allow perm=any all : dir=/var/lib/kubelet/
allow perm=any all : dir=/var/lib/rancher/
allow perm=any all : dir=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/
allow perm=any all : dir=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc
[root@ip-192-168-96-10 ~]# cat /etc/fapolicyd/trust.d/app 
# AUTOGENERATED FILE VERSION 2
# This file contains a list of trusted files
#
#  FULL PATH        SIZE                             SHA256
# /home/user/my-ls 157984 61a9960bf7d255a85811f4afcac51067b8f2e4c75e21cf4f2af95319d4ed1b87
/usr/bin/unzip 206704 299d6bae8ec58c76e087f8516cb6be438db2481bbab9b2b61a6c6a5c206a27f3
/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc 11068888 b3276789a9b735b758e6292ce469192c9ef77514bf7fa3b3fef77d631a4e4ee3
[root@ip-192-168-96-10 ~]# sha256sum /var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc
b3276789a9b735b758e6292ce469192c9ef77514bf7fa3b3fef77d631a4e4ee3  /var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc
[root@ip-192-168-96-10 ~]# ausearch -m fanotify --raw | aureport --file --summary
File Summary Report
===========================
total  file
===========================
609  /var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc

I've even tried to up resources provided to the fapolicy engine:

    # https://www.mankier.com/5/fapolicyd.conf
    - name: Update FAPolicy Configurations
      lineinfile:
        path: /etc/fapolicyd/fapolicyd.conf
        regexp: '{{ item.regexp }}'
        line: '{{ item.line }}'
      with_items: [
        { regexp: 'permissive.*', line: 'permissive = 0'},
        { regexp: 'q_size.*', line: 'q_size = 4096'},
        { regexp: 'subj_cache_size.*', line: 'subj_cache_size = 2048'},
        { regexp: 'obj_cache_size.*', line: 'subj_cache_size = 10240'}
      ]
      become: yes

Any help / guidance that can be provided would be very helpful. I have poured over RedHat docs, tried to test this hundreds of times over months now and have no idea why I can't seem to get fapolicy to run consistently or apply policies consistently.

kevinlmadison commented 2 years ago

:100:

radosroka commented 2 years ago

Hello, if the following is not a directory,it should be path= instead. And if you have it inside of a trust file you dont't need it.

allow perm=any all : dir=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc

Do you use integrity? Perhaps you could run fapolicyd with a --debug-deny option or switch policy to syslog... so you can see the detailed denials.

ccravens commented 2 years ago

Hello @radosroka thank you so much for your response! I went ahead and changed dir to path and still getting blocks.

[root@ip-192-168-64-10 ~]# ausearch -m fanotify --raw | aureport --file --summary
Email option is specified but /usr/lib/sendmail doesn't seem executable.Email option is specified but /usr/lib/sendmail doesn't seem executable.

File Summary Report
===========================
total  file
===========================
660  /var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc

Here are the rules. I even specified the rule file as a 01- in hopes the allow would be triggered first. Is there another rule downstream that is blocking this one file?

[root@ip-192-168-64-10 ~]# cat /etc/fapolicyd/rules.d/01-app.rules
# uids
%uuids=0,1000

# Run RKE2
allow perm=any all : dir=/opt/cni/
allow perm=any all : dir=/run/k3s/
allow perm=any all : dir=/var/lib/kubelet/
allow perm=any all : dir=/var/lib/rancher/
allow perm=any all : dir=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/
allow perm=any all : path=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc

Also I changed integrity from none to sha256:

[root@ip-192-168-64-10 ~]# cat /etc/fapolicyd/fapolicyd.conf
#
# This file controls the configuration of the file access policy daemon.
# See the fapolicyd.conf man page for explanation.
#

permissive = 0
nice_val = 14
q_size = 4096
uid = fapolicyd
gid = fapolicyd
do_stat_report = 1
detailed_report = 1
db_max_size = 50
subj_cache_size = 2048
subj_cache_size = 10240
watch_fs = ext2,ext3,ext4,tmpfs,xfs,vfat,iso9660,btrfs
trust = rpmdb,file
integrity = sha256
syslog_format = rule,dec,perm,auid,pid,exe,:,path,ftype,trust
rpm_sha256_only = 0

And the file is in the trust database:

[root@ip-192-168-64-10 ~]# cat /etc/fapolicyd/trust.d/app 
# AUTOGENERATED FILE VERSION 2
# This file contains a list of trusted files
#
#  FULL PATH        SIZE                             SHA256
# /home/user/my-ls 157984 61a9960bf7d255a85811f4afcac51067b8f2e4c75e21cf4f2af95319d4ed1b87
/usr/bin/unzip 206704 299d6bae8ec58c76e087f8516cb6be438db2481bbab9b2b61a6c6a5c206a27f3
/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc 11068888 b3276789a9b735b758e6292ce469192c9ef77514bf7fa3b3fef77d631a4e4ee3

And the SHA256 matches:

[root@ip-192-168-64-10 ~]# sha256sum /var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc
b3276789a9b735b758e6292ce469192c9ef77514bf7fa3b3fef77d631a4e4ee3  /var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/runc

what else can I do to see which rule is triggering a block? The problem is it is blocking before I can get in and ssh and see which rule. What other configs can I provide to help me see what is going on?

Thank you!

stevegrubb commented 2 years ago

It would also be helpful to know which version of fapolicyd you are using. We have been adding diagnostic options to fapolicyd-cli. Some of the --check options might find something.

To figure out what is happening, I'd recommend what Radovan said. Use syslog for the denies and make sure trust is in the syslog format. If you can determine the rule that is blocking, the syslog output should allow a determination of why and then what to do about it. We are working on getting a kernel patch accepted so that this is in the audit logs, but until then, syslog should have the answer. Find the denial after logging in.

stevegrubb commented 2 years ago

Also, use fapolicyd-cli --list once you have the denial from syslog to locate which rule is the one.

ccravens commented 2 years ago

Hello @stevegrubb

Version is:

rpm -qa | grep fap
rpm-plugin-fapolicyd-4.14.3-23.el8.x86_64
fapolicyd-1.1.3-8.el8.x86_64
ccravens commented 2 years ago

@stevegrubb @radosroka I appreciate your follow up to try and get me through this.

I'm having some trouble figuring out how to correctly apply syslog as was suggested. Are you saying that I should update all deny_audit rules to deny_syslog within the rules.d/ folder?

Also how do I find the rule id with the deny? Do I use ausearch and aureport to do that?

Thank you!

stevegrubb commented 2 years ago

Yes, change deny_audit to deny_syslog. The allows don't need touching. And do it to all rules files because we don't know where it's coming from. Then restart the daemon or server to reproduce.

You should be able to use "journalctl -b -u fapolicyd.service" to list out any events since boot by the fapolicyd service. Look for a deny message with your file. It will include the rule number and a bunch of other useful info. Paste that here. Also, do fapolicyd-cli --list and paste the matching rule here. Then we should be able to diagnose the issue.

ccravens commented 2 years ago

Great! I'll try that now!

Also in regards to rule order, I named the file 01-app.rules in order to guarantee allow and not have those particular files and directories to be denied. So 2 questions:

1) Is this recommended practice? Where should I put these rules in terms of file ordering? 2) Is the 01 considered apply first or apply last? I think that sometimes a 0 may come after a 9 in terms of search precedence.

I'll reply tomorrow with the results of the deny_syslog thank you for the guidance!

ccravens commented 2 years ago

Ok seeing this over and over:

Oct 01 00:29:07 ip-192-168-96-10.us-gov-east-1.compute.internal fapolicyd[42077]: rule=39 dec=deny_syslog perm=execute auid=-1 pid=65487 exe=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/containerd-shim-runc-v2 : path=/ ftyp>
Oct 01 00:29:09 ip-192-168-96-10.us-gov-east-1.compute.internal fapolicyd[42077]: rule=39 dec=deny_syslog perm=execute auid=-1 pid=65577 exe=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/containerd-shim-runc-v2 : path=/ ftyp>
O

Blocked by this rule I think

39. deny_syslog perm=execute all : all
stevegrubb commented 2 years ago

This is rather odd. It is saying it wants to execute '/'. The problem is containers do some odd things that fanotify can't always see. I suspect the '/' means it can't see what is being executed. Both bind mounts and overlayfs block fanotify from seeing what is being done. This might be fixed the 1.1.5 release, but it relies on FAN_MARK_FILESYSTEM which is such a big change that it might not be possible to backport to RHEL 8.

Let's figure out how to make a loophole for you. Add trust to the subject side: syslog_format = rule,dec,perm,auid,pid,exe,trust,:,path,ftype,trust rerun your test and paste it here in 2 parts, subject side and then below it object side. It seems that github has some kind of limit on how wide the text can be. It is getting cut off.

ccravens commented 2 years ago

Thank you @stevegrubb! Ok I'll do that now.

Yes my suspicion is there is more going on here than at the surface it just being a runc issue.

Altho I think I resolved this particular scenario because I think runc invokes the other executables within the /bin folder such as containerd-shim-runc-v2. So I am now explicitly adding all executables within the /bin directory to the trust database, not just runc, and I'm not longer getting any denials.

So even though the ausearch and aureport identify it as being a runc issue, my thought also is that runc is invoking another executable that was being blocked. I'll rebuild the environment with the syslog_format update, remove the fixes so I can re-introduce the error and provide that info.

ccravens commented 2 years ago

Oh and I also changed the name of the rule file from 01-app.rules to 11-app.rules just in case rule order precedence was the issue.

stevegrubb commented 2 years ago

Rule ordering does matter. But since rule 39 is blocking (and it's the catch all), anywhere ahead of it is fine. There is a README-rules document that says loopholes should be in the 20's. What I was going to suggest is a rule something like:

allow perm=any uid=0 trust=1dir=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/ : all

What this says is as long as root runs a trusted program from the bin directory, it can access anything.

Also, since we changed the deny rules to syslog, ausearch should not be finding any fanotify events. If you want it to go to both syslog and audit, you can use deny_log which tells it to put them in both places. It is normal troubleshooting procedure to either run the fapolicy daemon in debug mode or send events to syslog for troubleshooting. You have to know the rule number that is blocking to fix it. A patch has been in the works for the kernel so that audit events have the rule number, but it's still not accepted.

ccravens commented 2 years ago

Logs generated:

Oct 01 15:02:43 ip-192-168-96-10.us-gov-east-1.compute.internal fapolicyd[44351]: rule=39 dec=deny_syslog perm=execute auid=-1 pid=70943 exe=/var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/containerd-shim-runc-v2 trust=0 :
path=/ ftype=application/x-executable trust=0

At this point I believe the disconnect is that ausearch and aureport were reporting the runc executable as being blocked, but in reality it's the containerd-shim-runc-v2 which is most likely executed by runc. Does that seem plausible?

stevegrubb commented 2 years ago

Ausearch will report what is being accessed in the PATH record. The SYSCALL record is the subject information. It looks like the shim is not trusted and trying to execute something that we do not have information for because its in an overlayfs or a bind mounted directory (iow inside the container). I guess you could allow the shim to access anything. And if you want the shim trusted, you would use fapolicyd-cli --file add /var/lib/rancher/rke2/data/v1.22.6-rke2r1-e6c1502b55cd/bin/ and then you can use the rule I mentioned above.

ccravens commented 2 years ago

Hello @stevegrubb I wanted to update that this has been resolved I believe by adding the containerd-shim-runc-v2 to the trust database.

Before I close this issue it seems that there may be some issues that you've potentially identified:

The problem is containers do some odd things that fanotify can't always see. I suspect the '/' means it can't see what is being executed. Both bind mounts and overlayfs block fanotify from seeing what is being done. This might be fixed the 1.1.5 release, but it relies on FAN_MARK_FILESYSTEM which is such a big change that it might not be possible to backport to RHEL 8.

Before closing the issue, is there any additional debugging information I can provide to help determine if there is unexpected behavior or a potential bug? While my original intent was resolved by learning more how to debug via syslog (thank you!) if there's a chance to improve the project with this, I'd be more than happy to provide more info to do so.

Thank you!

stevegrubb commented 2 years ago

The container issue was just recently identified. That is why we did the 1.1.5 release - to get something out there that might be more useful. I added some more text to the README.md file to describe how to get syslog info for troubleshooting. There is a kernel patch in the works that when accepted, will get the right information in the audit logs for troubleshooting. That is really the problem right now. And, there is a tool fapolicy-analyzer that should help people with their deployments. It should be in EPEL in a month or so. If I have some time, I'll look over the troubleshooting section again and see if I can reorganize it to be easier to help people with their issues.

ccravens commented 2 years ago

Gotcha ok! I'll go ahead and close the issue and I look forward to these additional fixes / utility tools that will make managing fapolicy easier!

Thank you again for your help!