[pbr] issue: Switch from uci <command> to uci_<command> causes segfault

yitzhaq commented 11 months ago

Describe the bug Since commit ed4d51a671f0a91b392eb27a9c006227bd6ec893, pbr is now hitting a segfault when processing a particular policy, always failing at the same policy. Rolling back to b9267137a15201f38c1063a8c5626012ca91ba0e makes things work again.

I'd rather not paste every last bit of config here (my pbr setup is rather extensive), but I can do something targeted if you say what would be helpful.

FWIW the entry it segfaults on has these options:

name
interface
src_addr
dest_addr

Where interface is a WireGuard tunnel interface (named wg2), src_addr consists of several individual IPv4 addresses (without CIDR notation), and dest_addr is a fairly long list of CIDR-notated subnets. The entry before this shares all these basic characteristics, but all the values of course differ, so I can't obviously see what it is about the entry in question that triggers the segfault, though it is consistent.

stangri commented 11 months ago

@yitzhaq nothing about policies handling was changed in that commit, so I don't have a clue how that resulted in segfaults. Maybe faulty RAM/ROM and the update was rolled/started using the bad blocks?

Which OpenWrt version, device, pbr mode (iptables, nft, nft file) are you using and what exact message are you getting?

yitzhaq commented 11 months ago

OpenWRT 22.03.3, Protectli Vault FW6C (x86_64), pbr-iptables (for dnsmasq.ipset with domain-based policies), exact message (with verbosity 2) is:

Routing '<policy name>' via wg2 zsh: segmentation fault /etc/init.d/pbr start

I've tried using ash instead of zsh just in case, and tried several reboots. It always fails on this particular policy, starting from the commit mentioned. Doubt it's faulty ROM (it's an SSD), as I gradually rolled back commits until I found the last one where it still worked.

# ubus call system board
{
    "kernel": "5.10.161",
    "hostname": "router",
    "system": "Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz",
    "model": "Protectli FW6",
    "board_name": "protectli-fw6",
    "rootfs_type": "ext4",
    "release": {
        "distribution": "OpenWrt",
        "version": "22.03.3",
        "revision": "r20028-43d71ad93e",
        "target": "x86/64",
        "description": "OpenWrt 22.03.3 r20028-43d71ad93e"
    }
}

stangri commented 11 months ago

If you can add debug output in various places of the init script it might give a better idea what's causing this. My bet would be on the actual iptables call, I doubt the shell script itself causes the segfault.

yitzhaq commented 11 months ago

I suspect the same, although the script does terminate, which I suppose is not too surprising. Would probably need to figure out which command is causing it, I guess. Happy to add debug output, although I might need a bit of hand-holding to determine where and how.

I am planning to upgrade to 23.05 in a few weeks (for the newer dnsmasq), and the older commit is still working fine. So if this is something obscure which is only hitting me, we can hold off on digging any further until after I've switched over to that, in case fw4 and nft sorts things out. I'll try to remember to update this issue accordingly once that's done.

stangri commented 10 months ago

If you're using nft I can try to see if I can reproduce this if you post the policy it stumbles on.

yitzhaq commented 10 months ago

After upgrading to 23.05, and switching to nft, it no longer segfaults, but the same rule is now extremely slow to process. Several other rules are too, and what they all have in common is a rather long list of dest_addr. The one that triggered the segfault just happened to be the first in the list with this characteristic.

I don't particularly feel like posting the full list of subnets in dest_addr publicly, but I'd be happy to share it privately in whatever way you might prefer. Though my guess is that just adding a long list of CIDRs should be enough to reproduce - netmask -c can probably spit out something useful to work with.

stangri commented 10 months ago

@yitzhaq I'd definitely like to have a look at the config file. I've published a public key (https://dev.melmac.net/repo/dev.melmac.net.pub) which you can use to encrypt the config file with openssl and then I'd be able to decrypt it.

Sadly, it's a known issue that nft is very slow adding long lists of elements to the nft sets. The newer version of pbr (ones in my repo) use the nft atomic file by default, it may work faster in your case.

yitzhaq commented 10 months ago

Tested b926713 as well, without any noticeable difference - with nft it's just as slow as the latest version. So these are likely two different issues, though both affecting the entries with long lists.

stangri commented 10 months ago

Even the new version can fallback to individual nft calls instead of the using the nft_file_mode, but I believe you if you're saying that the nft_file_mode does not provide any significant performance improvements on adding large numbers of elements to the set.

yitzhaq commented 10 months ago

Actually, I just didn't notice your response as it arrived while I was typing out my last one. My original statement was about 1.1.3-25. I can't say I've done any exact timing comparisons between that and b926713, but they certainly both feel slow. But at least they don't break like recent commits started doing with iptables, which for me is the main thing.

Lemme figure out the right command line to throw at openssl, and I'll get something your way.

yitzhaq commented 10 months ago

It's getting late here, and the sources I can easily dig up seem slightly out of date (they all assume RSA rather than Ed25519). Could you give me a nudge in the right direction?

stangri commented 10 months ago

It's getting late here, and the sources I can easily dig up seem slightly out of date (they all assume RSA rather than Ed25519). Could you give me a nudge in the right direction?

I'll regenerate the key and let you know.

stangri commented 10 months ago

@yitzhaq added a paragraph/commands for encryption at https://docs.openwrt.melmac.net/pbr/#GettingHelp

yitzhaq commented 10 months ago

Thanks for that! Attached is a file containing two nearly identical rules in their config order - from memory I'm pretty sure it segfaults on the first. I had to zip the encrypted file for it to be accepted as an attachment here, even though that grew the filesize. Hopefully just the rule(s) should be enough to reproduce, but if not, let me know, and I can supply more config context. rules.zip

On a possibly (un)related note, I'm now also having big problems with dnsmasq since upgrading and switching to fw4/nft. The same set of rules which worked perfectly fine in 22.03 with fw3 and iptables, are causing dnsmasq to essentially stop responding to DNS queries. With pbr stopped it responds instantly (as one would expect), during loading of pbr rules it grows increasingly slow, and after pbr is done loading rules (which as noted takes a while, allowing plenty of time to test along the way) it nearly consistently times out queries. Switching to TCP can sometimes get the odd query through, but over UDP it becomes effectively useless. Big issue, as I'm sure you can imagine.

Any idea what might be causing this, and whether there are any dnsmasq options which can be tweaked to get things back in working order?

pbr config is identical as it was with fw3/iptables where everything worked hunky dory, only change is the value of resolver_set. Given what I'm observing, I suspect the slowness and the dnsmasq problems might be related.

yitzhaq commented 10 months ago

Oh, and the keyfile. keyfile.zip

yitzhaq commented 10 months ago

So after more testing and trying to narrow this down, this must somehow be related. The rules that cause dnsmasq to stop responding are the same ones that would cause iptables to segfault, and the ones which all have dest_addr set as in those attached above. Disable those rules, and dnsmasq works as expected again with 1.1.3-25. Roll back to b926713, and dnsmasq behaves as normal again, even with all rules active.

So clearly something happened in ed4d51a which is having various detrimental effects on rules with the dest_addr line seen in the attachment above. The symptoms are different with iptables and nftables, but stuff still breaks.

Possibly related errors logged:

Jan 17 13:12:55 router dnsmasq[1]: nftset inet fw4 pbr_wg0_6_dst_ip_cfg076ff5 Error: No such file or directory
Jan 17 13:12:55 router dnsmasq[1]: nftset inet fw4 pbr_wg0_4_dst_ip_cfg076ff5 Error: No such file or directory

yitzhaq commented 10 months ago

Ugh. Unfortunately I spoke too soon. Sorry for walls of text with back and forth - I really thought I had nailed it now..

So turns out dnsmasq still fails with b926713 too. It does however return cached queries, but fails to respond to any lookups that can't be answered from cache. It will respond to queries immediately after starting, but fail shortly after.

So, this will succeed:

/etc/init.d/dnsmasq restart ; dig @127.0.0.1 google.com

As well as subsequent queries for google.com (at least until TTL expires). But this fails:

# dig +short @127.0.0.1 google.ca
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out
;; no servers could be reached

The same nftset errors as above are logged when stopping pbr.

I'm starting to run out of ideas beyond downgrading until this can be figured out, so if you have any ideas at all, I'm all ears.

stangri commented 10 months ago

Attached is a file containing two nearly identical rules in their config order - from memory I'm pretty sure it segfaults on the first.

Can you post the full pbr config please, as well as the output from service pbr status?

yitzhaq commented 9 months ago

Sorry for the delay, I ran out of time to spend on this in the previous round. For now I've stayed on 23.05, but downgraded dnsmasq-full to 2.86 to get back iptables support, which brought things back to working order again. This really shouldn't be necessary, but I understand the sudden and unnecessary removal of iptables support is a controversy all on its own.

FWIW I can no longer reproduce the original segfault under these conditions, so perhaps that's a 22.03 thing. Running latest pbr version now.

Here's what you asked for. Please don't judge some of these rules too harshly, particularly the ones that invert the IPv4 address space. I know it's dumb, but after countless hours of trial and error, and several rounds of rewrites to try to do things differently, the inherent limitations of negating rules keep leading me back to this approach to accomplish what I need, in a way that is manageable. It may very well be the source of some of my issues, but for the most part it has worked better than one might expect, and keeps things less complex than they otherwise always seem to end up. Somewhat of a time-saver, as it were.

That said, clever suggestions that I may have missed are of course welcome.

stangri commented 9 months ago

The zip-file seems to be missing the encrypted key-file.

yitzhaq commented 9 months ago

The zip-file seems to be missing the encrypted key-file.

It should be this one.

stangri commented 9 months ago

Got it. And is the problem the dnsmasq crashing?

yitzhaq commented 9 months ago

The problem is not that dnsmasq crashes, but that it stops responding to (new) queries almost immediately after starting. It will respond with anything that's already in its cache, but ask for anything that's not and the query will time out.

This only happens with nft. With dnsmasq 2.86 and resolver_set set to dnsmasq.ipset, and otherwise the exact same config (pbr, dnsmasq, everything), it all works just fine.

It's triggered by the rules which invert the IPv4 address space (any rule which starts with 0.0.0.0/5) - comment those out, and it seems to behave normally.

stangri commented 9 months ago

Do you use https-dns-proxy? I was able to reproduce the issue in nft mode (with nft_file_support off), but after I had stopped/started https-dns-proxy I was able to resolve anything.

I don't understand why it's happening, but at least on my test OpenWrt, one of the rules you're inserting seems to affect the routing for the existing instances of https-dns-proxy.

yitzhaq commented 9 months ago

I'm not using https-dns-proxy, but I am using dnscrypt-proxy and stubby (on the router), among a number of other resolvers spread throughout the network. The router is also running Unbound and BIND. Long story.

I didn't test them all individually in detail with the issue present, but queries not explicitly forwarded through dnsmasq (which would be most of them not covered by any of the domain-based dest rules in pbr) continued working fine - though it could be they ended up taking a path not involving DoH/DoT. Given the number of possible paths queries may take, I'd have to dig a bit deeper to figure out where they actually go when the dnsmasq issue is present.

Strange though that this should happen with nft but not iptables, with the exact same rules. Depending on your upstream resolver(s) in https-dns-proxy I wouldn't be surprised if one of those rules changes its routing, but I also wouldn't expect that to make it stop working.

stangri commented 9 months ago

It works after restart of the proxy tho.

yitzhaq commented 9 months ago

I guess I'll need to do more testing to see if it's one of the local upstream resolvers dnsmasq forwards its queries to that is in fact failing, then. Though IIRC dnsmasq's algo to pick forwarders when it has a bunch to choose from is to assign queries semi-randomly, unlike certain others (Unbound and dnscrypt-proxy (with the right config) come to mind) which I believe will see which responds fastest, and then send most if not all queries to that until the next interval. Either way, dnsmasq has a bunch of resolvers to choose from, each with a slightly different set of upstreams, so I would expect it to find at least one path to get a response, rather than timing out queries so consistently.

Sigh. Anyway, not exactly expecting you to try to reproduce any part of this resolver setup, so I'll just do more testing when time allows, and see if I can get any answers. Might be a while again, though. Let me know if you suddenly think of something, especially theories as to why this might be happening (with nft only).

stangri / source.openwrt.melmac.net

[pbr] issue: Switch from uci <command> to uci_<command> causes segfault #192