Incompatibility with newer kernels

deathtrip commented 4 years ago

Newer kernel versions have broken opensnitch for me.

I used Arch Linux with the hardened kernel, and everything was fine until version 5.6.16 iirc. Since that version every attempt by any program to access the network causes the entire system to freeze, even things like ping, or even starting chromium. But the vanilla 5.6 kernel still worked fine.

When i updated the vanilla kernel to 5.7.2, all network requests were blocked (ping,dns etc.). Disabling the opensnitchd service solved both the crashing on the hardened kernel, and restored network access on vanilla. Since 5.7.4, even stopping the opensnitchd service won't restore network access, and i have to reboot to get it back.

I wonder if we have some people on new kernels, who can check it out.

gustavo-iniguez-goya commented 4 years ago

Can you tell us more about your system? I need at least the following logs and information:

First of all, set debug level to DEBUG.

/var/log/opensnitchd.log*
opensnitch settings (I'm interested in the process monitor method you're using, /proc, ftrace or audit)
hardened kernel information (grsecurity, pax, LSM, ..) https://wiki.archlinux.org/index.php/Security#Kernel_hardening ?
restrictions apply via sysctl (/etc/sysctl*)
libnetfilter_queue and libnfnetlink versions.
journalctl -ar > journalctl.txt
/var/log/syslog
/var/log/messages
kernel panic oops if you can gather it (in dmesg maybe)
did you notice any pattern that leads to the crash?

Please, provide all this information or the most you can gather. You can email me it if you prefer rather than post it here. Lets see if we can find the problem, or a way to reproduce it.

DragoonAethis commented 4 years ago

I'm running OpenSnitch on Arch w/ 5.7.4-1-ck-skylake kernel and it's working fine (nothing suspicious in logs, process monitor is /proc).

deathtrip commented 4 years ago

By hardened kernel i mean linux-hardened package from the repo. libnfnetlink 1.0.1 libnetfilter_queue 1.0.5 Process monitor method is /proc I tried disabling sysctl settings but they had no effect, also couldn't find anything suspicious using journalctl.

And after further research it looks like DNS requests are causing the crashes/hangups. I run unbound as my local resolver and i noticed that there were no more requests from the user "unbound" in the UI. Ping reported Temporary failure in name resolution Disabling unbound didn't solve the problem. When i booted with opensnitchd disabled, and enabled it afterwards, the already established connections persisted.

I got it working again by removing the following kernel command line options: slab_nomerge slub_debug=FZP vsyscall=none module.sig_enforce=1 page_alloc.shuffle=1

Any DNS request with these options and running opensnitchd will cause the mentioned symptoms. Here's the opensnitchd.log shows when it couldn't resolve anything: First it was this: [2m[2020-06-21 00:04:46][0m [97m[104m IMP [0m Starting opensnitch-daemon v1.0.0rc10 [2m[2020-06-21 00:04:46][0m [97m[42m INF [0m Loading rules from /etc/opensnitchd/rules ... [2m[2020-06-21 00:04:46][0m [97m[43m WAR [0m Is opnensitchd already running? [2m[2020-06-21 00:04:46][0m [97m[41m[1m !!! [0m Error while creating queue #0: Error opening Queue handle: protocol not supported and then: [2m[2020-06-21 07:20:59][0m [97m[104m IMP [0m Starting opensnitch-daemon v1.0.0rc10 [2m[2020-06-21 07:20:59][0m [97m[42m INF [0m Loading rules from /etc/opensnitchd/rules ... [2m[2020-06-21 07:20:59][0m [97m[43m WAR [0m Is opnensitchd already running? [2m[2020-06-21 07:20:59][0m [97m[41m[1m !!! [0m Error while creating queue #0: Error binding to queue: operation not permitted

So it's one (or more) of the kernel command line options. Didn't have yet time to find out which one is it. The linux-hardened package also enforces PTI, so that may have something to do with the system freezing under it, as i don't use it under the regular kernel. Let's see if anyone can reproduce in now.

gustavo-iniguez-goya commented 4 years ago

thank you very much for the information @deathtrip !

You're using libnetfilter_queue 1.0.5, and there have been a lot of changes from 1.0.3 to 1.0.5, I'm wondering if the problem also reproduces using libnetfilter_queue 1.0.3.

A few days ago someone on the original repo also reported a kernel panic using kernel 5.6.16, with this backtrace:

? nfqnl_reinject+0x4a/0x70 [nfnetlink_queue]
 nfqnl_reinject+0x4a/0x70 [nfnetlink_queue]
 nfqnl_recv_verdict+0x30d/0x500 [nfnetlink_queue]
 nfnetlink_rcv_msg+0x166/0x2e0 [nfnetlink]
 ? nfnetlink_net_exit_batch+0x60/0x60 [nfnetlink]
 netlink_rcv_skb+0x75/0x140
 netlink_unicast+0x242/0x340
 netlink_sendmsg+0x243/0x480
 sock_sendmsg+0x5e/0x60
 ____sys_sendmsg+0x253/0x290
 ___sys_sendmsg+0x97/0xe0
 ? __lru_cache_add+0x75/0xa0
 __sys_sendmsg+0x81/0xd0
 do_syscall_64+0x49/0x90

If you could find the backtrace of your kernel panic we could compare both.

Also If you have the time to identify the problematic kernel command line option don't doubt in update the issue. It's a very valuable information.

Either way, I'm afraid I'm not much of a help here. In my opinion this problem could be a bug in libnetfilter_queue or newer kernels (>= 5.6.16). Probably we're triggering the bug somehow.

deathtrip commented 4 years ago

On Arch Linux libnetfilter_queue was updated to 1.0.5 just a few days ago, while kernel 5.6.16 and newer have been around for a few weeks. So it started happening under libnetfilter_queue 1.0.3, but only after upgrading the kernels. Seems to be a problem with the kernel then.

deathtrip commented 4 years ago

The problematic kernel command-line option seems to be slub_debug=FZP. When i booted with all my options except slab_nomerge, slub_debug=FZP and page_alloc.shuffle=1, all worked fine. When i added page_alloc.shuffle=1 i got problems after 5-6 hours. Then i also added slab_nomerge, and got the DNS problems after approx. 1 hour. Booting with all three, results in no DNS access for me.

This time also the UI started freezing, so i restarted it from the terminal, but got no errors when it started freezing again. When i get these DNS problems i can't restart the daemon, as it fails with: systemd[1]: opensnitchd.service: Main process exited, code=exited, status=1/FAILURE

gustavo-iniguez-goya commented 4 years ago

Thank you @deathtrip ! I'll try to configure them and debug those DNS and daemon problems.

gustavo-iniguez-goya commented 4 years ago

On Debian, kernel 5.7.0, this cmdline parameters "kaslr pti=on slab_nomerge page_poison=1 slub_debug=FPZ nosmt" causes opensnitch to fail with the following error: Error while creating queue #0: Error binding to queue: operation not permitted

removing slub_debug=FPZ from the options solves the problem. I'm still trying to figure out how to make it working again with that parameter.

deathtrip commented 4 years ago

Update to the current point release to see if it fixes anything. Also you should check if only one or two of the FZP options is responsible. Wondering if you could reproduce the system freeze i had on Arch's hardened kernel, because it seems there's something else that could be the problem here.

gustavo-iniguez-goya commented 4 years ago

not the freeze, but a BUG, like the one reported with kernel 5.6.16. There're some bug reports related to this parameter athk5 kmod, ext4, ibm mmfs16

I started to analyze it with valgrind, and it seems to be there some mem leaks. In any case, this parameter is also preventing opensnitch from running with the Operation not permitted error so I'll investigate that first.

gustavo-iniguez-goya commented 4 years ago

The error Error binding to queue: operation not permitted seems to be caused by a queue not closed, or leaked. If you launch the daemon with -queue-num 2 then it'll run as expected.

This only occurs when stopping the daemon with service opensnitch stop. If you stop it by sending it a HUP signal, or by hitting CTRL+c it doesn't occur. Investigating..

gustavo-iniguez-goya commented 4 years ago

ok, so we're not closing the queue on exit.. and for some reason this problem has arised with kernels >= 5.7.x. I'll fix it soon.

gustavo-iniguez-goya commented 4 years ago

Still investigating this problem. Fortunately for me, the bug it's not freezing the PC. The daemon stops processing packets and a trace is dumped to dmesg.

gustavo-iniguez-goya commented 4 years ago

I've filed a bug on the netfilter bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1440

I think this is a problem in their library when allowing packets (ICMP in particular I think).

gustavo-iniguez-goya commented 4 years ago

Pablo Neira posted a patch for this problem, and as far as I can tell it fixes the bug: https://bugzilla.netfilter.org/show_bug.cgi?id=1440#c1

https://github.com/evilsocket/opensnitch/issues/297 https://github.com/safing/portmaster/issues/82

gustavo-iniguez-goya commented 4 years ago

Some users have already confirmed that it's fixed by updating ArchLinux kernel. Thank you for reporting it!

gustavo-iniguez-goya / opensnitch

Incompatibility with newer kernels #41