Closed deathtrip closed 4 years ago
Can you tell us more about your system? I need at least the following logs and information:
First of all, set debug level to DEBUG.
Please, provide all this information or the most you can gather. You can email me it if you prefer rather than post it here. Lets see if we can find the problem, or a way to reproduce it.
I'm running OpenSnitch on Arch w/ 5.7.4-1-ck-skylake kernel and it's working fine (nothing suspicious in logs, process monitor is /proc).
By hardened kernel i mean linux-hardened package from the repo. libnfnetlink 1.0.1 libnetfilter_queue 1.0.5 Process monitor method is /proc I tried disabling sysctl settings but they had no effect, also couldn't find anything suspicious using journalctl.
And after further research it looks like DNS requests are causing the crashes/hangups.
I run unbound as my local resolver and i noticed that there were no more requests from the user "unbound" in the UI.
Ping reported Temporary failure in name resolution
Disabling unbound didn't solve the problem.
When i booted with opensnitchd disabled, and enabled it afterwards, the already established connections persisted.
I got it working again by removing the following kernel command line options:
slab_nomerge slub_debug=FZP vsyscall=none module.sig_enforce=1 page_alloc.shuffle=1
Any DNS request with these options and running opensnitchd will cause the mentioned symptoms.
Here's the opensnitchd.log shows when it couldn't resolve anything:
First it was this:
[2m[2020-06-21 00:04:46][0m [97m[104m IMP [0m Starting opensnitch-daemon v1.0.0rc10 [2m[2020-06-21 00:04:46][0m [97m[42m INF [0m Loading rules from /etc/opensnitchd/rules ... [2m[2020-06-21 00:04:46][0m [97m[43m WAR [0m Is opnensitchd already running? [2m[2020-06-21 00:04:46][0m [97m[41m[1m !!! [0m Error while creating queue #0: Error opening Queue handle: protocol not supported
and then:
[2m[2020-06-21 07:20:59][0m [97m[104m IMP [0m Starting opensnitch-daemon v1.0.0rc10 [2m[2020-06-21 07:20:59][0m [97m[42m INF [0m Loading rules from /etc/opensnitchd/rules ... [2m[2020-06-21 07:20:59][0m [97m[43m WAR [0m Is opnensitchd already running? [2m[2020-06-21 07:20:59][0m [97m[41m[1m !!! [0m Error while creating queue #0: Error binding to queue: operation not permitted
So it's one (or more) of the kernel command line options. Didn't have yet time to find out which one is it. The linux-hardened package also enforces PTI, so that may have something to do with the system freezing under it, as i don't use it under the regular kernel. Let's see if anyone can reproduce in now.
thank you very much for the information @deathtrip !
You're using libnetfilter_queue 1.0.5, and there have been a lot of changes from 1.0.3 to 1.0.5, I'm wondering if the problem also reproduces using libnetfilter_queue 1.0.3.
A few days ago someone on the original repo also reported a kernel panic using kernel 5.6.16, with this backtrace:
? nfqnl_reinject+0x4a/0x70 [nfnetlink_queue]
nfqnl_reinject+0x4a/0x70 [nfnetlink_queue]
nfqnl_recv_verdict+0x30d/0x500 [nfnetlink_queue]
nfnetlink_rcv_msg+0x166/0x2e0 [nfnetlink]
? nfnetlink_net_exit_batch+0x60/0x60 [nfnetlink]
netlink_rcv_skb+0x75/0x140
netlink_unicast+0x242/0x340
netlink_sendmsg+0x243/0x480
sock_sendmsg+0x5e/0x60
____sys_sendmsg+0x253/0x290
___sys_sendmsg+0x97/0xe0
? __lru_cache_add+0x75/0xa0
__sys_sendmsg+0x81/0xd0
do_syscall_64+0x49/0x90
If you could find the backtrace of your kernel panic we could compare both.
Also If you have the time to identify the problematic kernel command line option don't doubt in update the issue. It's a very valuable information.
Either way, I'm afraid I'm not much of a help here. In my opinion this problem could be a bug in libnetfilter_queue or newer kernels (>= 5.6.16). Probably we're triggering the bug somehow.
On Arch Linux libnetfilter_queue was updated to 1.0.5 just a few days ago, while kernel 5.6.16 and newer have been around for a few weeks. So it started happening under libnetfilter_queue 1.0.3, but only after upgrading the kernels. Seems to be a problem with the kernel then.
The problematic kernel command-line option seems to be slub_debug=FZP. When i booted with all my options except slab_nomerge, slub_debug=FZP and page_alloc.shuffle=1, all worked fine. When i added page_alloc.shuffle=1 i got problems after 5-6 hours. Then i also added slab_nomerge, and got the DNS problems after approx. 1 hour. Booting with all three, results in no DNS access for me.
This time also the UI started freezing, so i restarted it from the terminal, but got no errors when it started freezing again.
When i get these DNS problems i can't restart the daemon, as it fails with: systemd[1]: opensnitchd.service: Main process exited, code=exited, status=1/FAILURE
Thank you @deathtrip ! I'll try to configure them and debug those DNS and daemon problems.
On Debian, kernel 5.7.0, this cmdline parameters "kaslr pti=on slab_nomerge page_poison=1 slub_debug=FPZ nosmt" causes opensnitch to fail with the following error:
Error while creating queue #0: Error binding to queue: operation not permitted
removing slub_debug=FPZ
from the options solves the problem. I'm still trying to figure out how to make it working again with that parameter.
Update to the current point release to see if it fixes anything. Also you should check if only one or two of the FZP options is responsible. Wondering if you could reproduce the system freeze i had on Arch's hardened kernel, because it seems there's something else that could be the problem here.
not the freeze, but a BUG, like the one reported with kernel 5.6.16. There're some bug reports related to this parameter athk5 kmod, ext4, ibm mmfs16
I started to analyze it with valgrind, and it seems to be there some mem leaks. In any case, this parameter is also preventing opensnitch from running with the Operation not permitted error
so I'll investigate that first.
The error Error binding to queue: operation not permitted
seems to be caused by a queue not closed, or leaked. If you launch the daemon with -queue-num 2
then it'll run as expected.
This only occurs when stopping the daemon with service opensnitch stop
. If you stop it by sending it a HUP signal, or by hitting CTRL+c it doesn't occur. Investigating..
ok, so we're not closing the queue on exit.. and for some reason this problem has arised with kernels >= 5.7.x. I'll fix it soon.
Still investigating this problem. Fortunately for me, the bug it's not freezing the PC. The daemon stops processing packets and a trace is dumped to dmesg.
I've filed a bug on the netfilter bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1440
I think this is a problem in their library when allowing packets (ICMP in particular I think).
Pablo Neira posted a patch for this problem, and as far as I can tell it fixes the bug: https://bugzilla.netfilter.org/show_bug.cgi?id=1440#c1
https://github.com/evilsocket/opensnitch/issues/297 https://github.com/safing/portmaster/issues/82
Some users have already confirmed that it's fixed by updating ArchLinux kernel. Thank you for reporting it!
Newer kernel versions have broken opensnitch for me.
I used Arch Linux with the hardened kernel, and everything was fine until version 5.6.16 iirc. Since that version every attempt by any program to access the network causes the entire system to freeze, even things like ping, or even starting chromium. But the vanilla 5.6 kernel still worked fine.
When i updated the vanilla kernel to 5.7.2, all network requests were blocked (ping,dns etc.). Disabling the opensnitchd service solved both the crashing on the hardened kernel, and restored network access on vanilla. Since 5.7.4, even stopping the opensnitchd service won't restore network access, and i have to reboot to get it back.
I wonder if we have some people on new kernels, who can check it out.