tusc / wireguard-kmod

WireGuard for UDM series routers
https://www.wireguard.com/
352 stars 19 forks source link

Kernel crash WireGuard + WiFiman [Internal error: Oops - BUG: 0 [#1] SMP] #43

Closed bluewalk closed 2 years ago

bluewalk commented 2 years ago

Today I've noticed that the UDM Pro reboots when executing a WiFiman speedtest, on the upload test.

Some background: I have a tunnel setup to a VPN provider on interface vpn0 that forwards all traffic from a specific VLAN over that tunnen (eg a "vpn proxied wifi network". When running a WiFiman speedtest (and only a WiFiman speedtest) the kernel panics and cashes, below the results from a watch dmesg.

When running a speedtest with speedtest.net all is well. Running a WiFiman speedtest on the "non tunneled" VLAN all is well too.

This is 100% reproducable on my UDM Pro. Running v1.11.0, 4.19.152-al-linux-v10.2.0-v1.11.0.3921-f2e3fac (builder@link-xenial-builder) (gcc version 6.4.0 (Buildroot 0.9.0.0)) #1 SMP Tue Dec 21 08:09:26 UTC 2021 so it should use the builtin wireguard module, so I'm not sure if there's anything you can do about that?

[   61.429430] systemd-journald[28]: Received request to flush runtime journal from PID 1
[   63.858291] wireguard: WireGuard 1.0.20210124 loaded. See www.wireguard.com for information.
[   63.858295] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[  291.132162] wireguard: vpn0: possible loop detected, dropping skb of size 65216
[  291.132227] Virtual device vpn0 asks to queue packet!
[  291.132235] ------------[ cut here ]------------
[  291.132238] kernel BUG at mm/slub.c:3950!
[  291.132242] Internal error: Oops - BUG: 0 [#1] SMP
[  291.137067] Modules linked in: wireguard ip6_udp_tunnel udp_tunnel xt_conntrack nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp nf_nat_h323 nf_conntrack_h323 nf_nat_proto_gre nf_conntrack_proto_gre nf_nat_ftp nf_conntrack_ftp nf_app(PO) t_miner(PO) tdts(PO) xt_dpi(O) ifb sch_htb ppp_generic slhc ebt_set ebt_arp ebt_ip ebt_pkttype ebtable_nat ebtables lzo lzo_compress zram gpiodev(PO) ubnthal(PO) ubnt_common(PO)
[  291.174653] Process ksoftirqd/1 (pid: 16, stack limit = 0x0000000022a2a958)
[  291.181641] CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: P           O      4.19.152-al-linux-v10.2.0-v1.11.0.3921-f2e3fac #1
[  291.192735] Hardware name: Annapurna Labs Alpine V2 UBNT (DT)
[  291.198534] pstate: 40000005 (nZcv daif -PAN -UAO)
[  291.203351] pc : kfree+0xd8/0x120
[  291.206718] lr : skb_free_head+0x18/0x30
[  291.210663] sp : ffffff8008d133c0
[  291.214002] x29: ffffff8008d133c0 x28: ffffffc09a704628 
[  291.219363] x27: 0000000000000014 x26: 0000000000000034 
[  291.224697] x25: ffffffc066b7fec0 x24: ffffffc2379cb200 
[  291.230058] x23: 0000000000000001 x22: ffffffc235b86000 
[  291.235390] x21: ffffffc09a704600 x20: ffffffc066b7fec0 
[  291.240750] x19: 0000000000000000 x18: ffffff8008bc9d88 
[  291.246084] x17: 0000000000000001 x16: 0000000000000000 
[  291.251443] x15: 0000000000020000 x14: 00000000fffffff0 
[  291.256778] x13: ffffff8008be01c8 x12: 0000000000000000 
[  291.262113] x11: 0000000000000289 x10: 0000000000000002 
[  291.267478] x9 : 0000000000000000 x8 : ffffff8008c741a8 
[  291.272813] x7 : ffffffc066b70000 x6 : ffffffbf019adc00 
[  291.278184] x5 : ffffff80086659c8 x4 : ffffffc23ff8c868 
[  291.283519] x3 : ffffffc23ff8c868 x2 : 0000000000000000 
[  291.288879] x1 : 0000000000000000 x0 : ffffffbf01916008 
[  291.294213] Call trace:
[  291.296686]  kfree+0xd8/0x120
[  291.299706]  skb_free_head+0x18/0x30
[  291.303305]  skb_release_data+0x128/0x150
[  291.307336]  skb_release_all+0x24/0x30
[  291.311136]  kfree_skb+0x2c/0x60
[  291.314389]  kfree_skb_list+0x18/0x30
[  291.318076]  __dev_queue_xmit+0x3b0/0x890
[  291.322141]  dev_queue_xmit+0x10/0x20
[  291.325827]  neigh_direct_output+0x10/0x20
[  291.329947]  ip_finish_output2+0x1f0/0x3d0
[  291.334095]  ip_finish_output+0x170/0x220
client_loop: send disconnect: Broken pipe

If you need any more info, please let me know :-)

peacey commented 2 years ago

Hi @bluewalk,

Thanks for the report. I haven't been able to reproduce this on my end on 1.11.0 stable (with neither IPv6 nor IPv4 VPN servers). Can you give more details about your setup?

  1. How is the WiFiman device connected?
  2. Can you show me the wireguard config without the keys?
  3. You said you're forcing a whole VLAN, how are you doing that? Can you show me the configuration for that? You can upload the configs as .txt files to your post.
  4. Have you tried different VPN servers or providers and see if the problem persists?

Yes you're right the module is built in, so there's nothing we can do about that (unless we compile the newer version but then we need the new kernel sources). However, this line in your dmesg makes me think there is a routing loop that we could maybe fix?

[  291.132162] wireguard: vpn0: possible loop detected, dropping skb of size 65216

Never seen this myself, so let's see what your configuration looks like.

Thanks!

bluewalk commented 2 years ago

Hi @peacey,

No worries, must be something specific to my setup, WiFiman and 1.11.0 as I didn't have these issues with 1.10.x.

  1. How is the WiFiman device connected? The WiFiman device is connected via WiFi, 802.11ax to an U6-LR.
  2. Can you show me the wireguard config without the keys?
    
    [Interface]
    Address = 10.5.0.2/32
    PrivateKey = xxx
    Table = off
    SaveConfig = false

PostUp = /etc/wireguard/scripts/vpn.sh up PostDown = /etc/wireguard/scripts/vpn.sh down

[Peer]

NordVPN

PublicKey = xxxx AllowedIPs = 0.0.0.0/0 Endpoint = nl956.nordvpn.com:51820


3. _You said you're forcing a whole VLAN, how are you doing that? Can you show me the configuration for that? You can upload the configs as .txt files to your post._
I am using a script I wrote: https://gist.github.com/bluewalk/cd3ea3dc23c83a0cd0811652a75ba4cf
It basically marks traffic from that VLAN, exclude private addresses and netflix ASN and then NAT's it over the tunnel interface via an ip rule

4. _Have you tried different VPN servers or providers and see if the problem persists?_
I have not yet tried different providers but did try different servers from NordVPN.
Once the WiFiman upload test starts, the UDMP reboots. So it must be WiFiman app+unifi network application related too (running the app, not the website), as with the speedtest.net app on that same device (iPhone and iPad) do not cause any issues. Neither does wifiman.com.

Running WiFiman app version 0.15.2 and Unifi Network application 7.0.15
peacey commented 2 years ago

Thanks @bluewalk. When I previously tested, I used my split-vpn script for VPN routing. I now tested with your provided vpn.sh from your gist, but still was not able to replicate the issue. My UDMP is not crashing with the upload test from WiFiman even with your vpn.sh. The only difference is I am using mullvad instead of Nord.

One thing I noticed in your script is you are forcing both br100 and wg0 through the tunnel (vpn0). Is wg0 your wireguard server on the UDMP? Is your WiFiman device connected to both br100 and wg0 at the same time when you run the test? Do you still have the issue if you don't force wg0 through (make sure rules are deleted too), or if you bring down wg0 completely?

If nothing works, could you set up your NordVPN with the split-vpn script, and see if you still have the same issue? My script adds explicit routes for the VPN endpoint so maybe that would help with the routing loop? Just thinking of things that could help you debug this.

I would also do a tcpdump on the WAN interface and the wireguard vpn0 interface when you're doing the upload test to observe if there are any routing loops.

bluewalk commented 2 years ago

One thing I noticed in your script is you are forcing both br100 and wg0 through the tunnel (vpn0). Is wg0 your wireguard server on the UDMP?

Yes it’s a WireGuard server for connecting to another site (parents) and my phone when not home. It’s internet is also routed through so its “secured”.

Is your WiFiman device connected to both br100 and wg0 at the same time when you run the test?

No it’s not. When using 4G + VPN to home (wg0) internet traffic goes through the same vpn0 tunnel. Running WiFiman then causes no issues (as it can’t discover the UDMP I guess).

Running the WiFiman test to the UDMP directly and not internet (that’s an option too), the test succeeds.

Do you still have the issue if you don't force wg0 through (make sure rules are deleted too), or if you bring down wg0 completely?

Will test tomorrow to be sure but last time it made no difference.

If nothing works, could you set up your NordVPN with the split-vpn script, and see if you still have the same issue? My script adds explicit routes for the VPN endpoint so maybe that would help with the routing loop? Just thinking of things that could help you debug this.

Will definitely give that a shot tomorrow as well

I would also do a tcpdump on the WAN interface and the wireguard vpn0 interface when you're doing the upload test to observe if there are any routing loops.

Will do :-)

Quite odd actually that only WiFiman app traffic is causing this. All the other traffic through the tunnel does not cause any issues. I am basically using it for all my traffic. I know there’s some functionality in WiFiman to talk to the UDMP, maybe that’s causing issues? Like I said, all other upload traffic is dandy.

instant update I have just rerun the test and it somehow succeeded the last couple of times.

bluewalk commented 2 years ago

I have tried again today and now it's not crashing anymore, neither are there anymore "possible loops" detected. Maybe something was up at NordVPN with some broken packages causing WireGuard to choke itself?

bluewalk commented 2 years ago

Today I tried it again, the UDMP crashed upon the upload test.

Therefore I was suspecting NordVPN so I dusted off my Mullvad account, added a month and tried that. Unfortunately the crashing also occurs when using Mullvad.

I do see some more errors upon boot regarding eth9, the SFP module for my fiber connection:

[   61.412191] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 1
[   61.412697] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 2
[   61.413450] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 3
[   61.413958] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 4
[   61.414194] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 5
[   61.414447] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 6
[   61.416004] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 7
[   61.417514] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 8
[   61.417764] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 9
[   61.418281] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 10
[   61.419292] al_eth 0000:00:00.0 eth9: ETH(0)-MAC 1G TX Data FIFO SRAM Parity Error: 11

There only appear upon a boot after the crash. Maybe this is related? Still weird it only occurs when using the WiFiman app on my phone and not when using speedtest or other sites. It must be WiFiman related.

peacey commented 2 years ago

Hi @bluewalk,

Unfortunately I'm still not able to reproduce this on my end. Can you try a few things (not in any particular order...)?

  1. Can you test with split-vpn as it adds explicit routes for the VPN gateway and it might help with routing loops?
  2. Can you do a tcpdump on your eth9 and wireguard interface to see what's happening on the interface when it crashes? (We can see if there actually loops that are causing this or it's something else)
  3. Can you try to use the eth8 Rj45 port temporarily for your Internet, and test if the problem still happens with it?
  4. If the problem doesn't happen with RJ45 port, do you have any other SFP adapters you could try instead?
bluewalk commented 2 years ago

1. Can you test with split-vpn as it adds explicit routes for the VPN gateway and it might help with routing loops?

I have just tried this, still popping up with possible loops, but it didn't crash this time. So I tried again with my own script, also didn't crash this time, It appears the "possible loop detected" warning only pops up when using WiFiman.

2. Can you do a tcpdump on your eth9 and wireguard interface to see what's happening on the interface when it crashes? (We can see if there actually loops that are causing this or it's something else)

Can I send those to you somehow?

3. Can you try to use the eth8 Rj45 port temporarily for your Internet, and test if the problem still happens with it? I do not have a media converter lying around to convert my fiber to ethernet

4. If the problem doesn't happen with RJ45 port, do you have any other SFP adapters you could try instead? See above

peacey commented 2 years ago

You can upload the tcpdumps as .txt files to your post or put them on pastebin.com. But just remove your public IP if it is in the dump.

Do you have any SFP+ switches that have SFP+ and Ethernet ports you could use? If it's a hardware issue, this might be the best way to confirm it.

bluewalk commented 2 years ago

You can upload the tcpdumps as .txt files to your post or put them on pastebin.com. But just remove your public IP if it is in the dump. I had created wireshark captures, 1Gb each. So I went ahead and created a new capture with just text: https://file.io/4IcoqUlDUMmg

Do you have any SFP+ switches that have SFP+ and Ethernet ports you could use? If it's a hardware issue, this might be the best way to confirm it. Not at the moment, but quite sure it isn't a hardware issue as everything works fine besides WiFiman.

bluewalk commented 2 years ago

Today, while nobody was home, the UDM crashed randomly a couple of times. According to the kernel crash log it's wireguard so it's safe to say it doesn't really have to

<4>[   60.422637] ------------[ cut here ]------------
<2>[   60.422640] kernel BUG at mm/slub.c:3950!
<0>[   60.422643] Internal error: Oops - BUG: 0 [#1] SMP
<4>[   60.427464] Modules linked in: wireguard ip6_udp_tunnel udp_tunnel xt_conntrack nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp nf_nat_h323 nf_conntrack_h323 nf_nat_proto_gre nf_conntrack_proto_gre nf_nat_ftp nf_conntrack_ftp nf_app(PO) t_miner(PO) tdts(PO) xt_dpi(O) ifb sch_htb ppp_generic slhc ebt_set ebt_arp ebt_ip ebt_pkttype ebtable_nat ebtables lzo lzo_compress zram gpiodev(PO) ubnthal(PO) ubnt_common(PO)
<0>[   60.465017] Process kworker/0:2 (pid: 102, stack limit = 0x000000009dac2da1)
<4>[   60.472089] CPU: 0 PID: 102 Comm: kworker/0:2 Tainted: P           O      4.19.152-al-linux-v10.2.0-v1.11.0.3921-f2e3fac #1
<4>[   60.483260] Hardware name: Annapurna Labs Alpine V2 UBNT (DT)
<4>[   60.489061] Workqueue: wg-crypt-vpn0 wg_packet_decrypt_worker [wireguard]
<4>[   60.495897] pstate: 40000005 (nZcv daif -PAN -UAO)
<4>[   60.500711] pc : kfree+0xd8/0x120
<4>[   60.504052] lr : skb_free_head+0x18/0x30
<4>[   60.508022] sp : ffffff80099eb5f0
<4>[   60.511359] x29: ffffff80099eb5f0 x28: ffffffc224d64628 
<4>[   60.516694] x27: 0000000000000014 x26: 0000000000000028 
<4>[   60.522056] x25: ffffffc0babbfec0 x24: ffffffc232545000 
<4>[   60.527388] x23: 0000000000000000 x22: ffffffc216701000 
<4>[   60.532748] x21: ffffffc224d64600 x20: ffffffc0babbfec0 
<4>[   60.538082] x19: 0000000000000000 x18: ffffff8008bc9d88 
<4>[   60.543441] x17: 0000000000000001 x16: 0000000000000000 
<4>[   60.548774] x15: 0000000000020000 x14: 00000000fffffff0 
<4>[   60.554133] x13: ffffff8008be01c8 x12: 0000000000000000 
<4>[   60.559466] x11: 0000000000000287 x10: 0000000000000002 
<4>[   60.564825] x9 : 0000000000000000 x8 : ffffff8008c741a8 
<4>[   60.570158] x7 : ffffffc0babb0000 x6 : ffffffbf02eaec00 
<4>[   60.575491] x5 : ffffff80086659c8 x4 : ffffffc23ff77868 
<4>[   60.580852] x3 : ffffffc23ff77868 x2 : 0000000000000000 
<4>[   60.586184] x1 : 0000000000000000 x0 : ffffff8008c52120 
<4>[   60.591544] Call trace:
<4>[   60.594015]  kfree+0xd8/0x120
<4>[   60.597006]  skb_free_head+0x18/0x30
<4>[   60.600631]  skb_release_data+0x128/0x150
<4>[   60.604664]  skb_release_all+0x24/0x30
<4>[   60.608436]  kfree_skb+0x2c/0x60
<4>[   60.611714]  kfree_skb_list+0x18/0x30
<4>[   60.615400]  __dev_queue_xmit+0x3b0/0x890
<4>[   60.619433]  dev_queue_xmit+0x10/0x20
<4>[   60.623146]  neigh_direct_output+0x10/0x20
<4>[   60.627267]  ip_finish_output2+0x1f0/0x3d0
<4>[   60.631389]  ip_finish_output+0x170/0x220
<4>[   60.635449]  ip_output+0x110/0x120
<4>[   60.638874]  ip_forward_finish+0x64/0x90
<4>[   60.642819]  ip_forward+0x3f0/0x490
<4>[   60.646359]  ip_rcv_finish+0x3c/0x50
<4>[   60.649955]  ip_rcv+0xb8/0xd0
<4>[   60.652948]  __netif_receive_skb_one_core+0x54/0x80
<4>[   60.657875]  __netif_receive_skb+0x2c/0x70
<4>[   60.661995]  netif_receive_skb_internal+0x58/0xe0
<4>[   60.666720]  napi_gro_complete+0xd0/0x100
<4>[   60.670778]  napi_gro_flush+0x98/0x100
<4>[   60.674551]  napi_complete_done+0x40/0x110
<4>[   60.678673]  wg_packet_rx_poll+0x494/0x880 [wireguard]
<4>[   60.683860]  net_rx_action+0x224/0x310
<4>[   60.687632]  __do_softirq+0xf4/0x1f8
<4>[   60.691259]  do_softirq.part.2+0x4c/0x50
<4>[   60.695205]  __local_bh_enable_ip+0x94/0xa0
<4>[   60.699413]  wg_packet_decrypt_worker+0x168/0x230 [wireguard]
<4>[   60.705208]  process_one_work+0x200/0x360
<4>[   60.709240]  worker_thread+0x48/0x480
<4>[   60.712926]  kthread+0xf8/0x130
<4>[   60.716119]  ret_from_fork+0x10/0x1c
<0>[   60.719717] Code: f94000c0 377800a0 f94004c0 37000060 (d4210000) 
<4>[   60.725859] ---[ end trace 0dc356291955837d ]---
<0>[   60.734256] Kernel panic - not syncing: Fatal exception in interrupt
<2>[   60.740659] SMP: stopping secondary CPUs
<0>[   60.744605] Kernel Offset: disabled
<0>[   60.748117] CPU features: 0x0,20006008
<0>[   60.751

To be fair, this is with the ubiquiti built wireguard kernel module, not yours. Are you goign to compile one of yours some day? As I have more faith in yours than ubiquiti's since these issues never occurred with yours :)

peacey commented 2 years ago

Hey @bluewalk, sorry for taking my time. Nothing seems odd in your tcpdump. And it's weird it's still crashing for you randomly, honestly have never had this happen and I run it constantly.

However, we are now able to compile the latest module for 1.11.0 thanks to @pwallner for getting the kernel sources. So we'll have the new module for you to test shortly.

bluewalk commented 2 years ago

No worries @peacey, already very happy you take the time to reply. Maybe it has to do with multiple configs running multiple vpns?

Great news about the sources, looking forward to test it.

On a side note, does WireGuard/wireguard-vyatta-ubnt use your kernels as well in their packages? As the scripts seem the same.

[edit] I should’ve read their credits and see them mentioning you 😅

peacey commented 2 years ago

How many VPNs are you running, and are they all wireguard?

I just published a new release with WG built for 1.11.0. Try to upgrade and see if you still have the issue. The module was also updated to 1.0.20211208 (while UDM was on 1.0.20210124), so let's see what happens!

When you upgrade, open up setup_wireguard.sh and set LOAD_BUILTIN=0 at the top so it doesn't load the built-in module. If you previously loaded wireguard, run rmmod wireguard to unload the module first. Then run the script and check your dmesg to see if the module loaded with the correct version.

On a side note, does WireGuard/wireguard-vyatta-ubnt use your kernels as well in their packages? As the scripts seem the same.

Yes, as you guessed it's the same script. The two repos are updated in parallel.

bluewalk commented 2 years ago

How many VPNs are you running, and are they all wireguard?

I am running 4 VPN's, all wireguard (including the NordVPN/Mullvad vpn) mostly to be able to apply firewall rules per VPN interface.

I just published a new release with WG built for 1.11.0.

Great news! I have immediately upgraded and it has loaded your new module:

[   37.936985] wireguard: WireGuard 1.0.20211208 loaded. See www.wireguard.com for information.
[   37.936988] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

I'll keep an eye on it to see if it randomly reboots. But I have good hopes, as it all started witht the UBNT module.

bluewalk commented 2 years ago

I have been running your kernel module for almost a week now and haven't experienced any weird behaviour nor crashes 👍

peacey commented 2 years ago

That's good to hear, @bluewalk. And you've tried the WiFiman upload tests like before?

I guess the built-in module was an older version so it might have had some bugs that were fixed in the later versions. I don't think it's necessarily Ubiquiti's fault.

Anyhow, maybe watch it for a couple more weeks and if it's still good, we can mark this as resolved!

bluewalk commented 2 years ago

Yes, I have ran the WiFiman upload tests a couple of times every day and so far no glitches.

Could be that their module had bugs indeed. Will keep an eye out for a couple of weeks. Thanks!

bluewalk commented 2 years ago

Up for 17 days now without any issues. Closing 👍