tusc / wireguard-kmod

WireGuard for UDM series routers
https://www.wireguard.com/
350 stars 18 forks source link

latest v10 release causes kernel crashe #5

Closed tusc closed 2 years ago

tusc commented 3 years ago

The latest release of the v10 kernel builds and loads but will cause a crash when wg-quick is executed. You can check for UDM kernel crashes in /sys/fs/pstore/console-ramoops

This didn't happen with the older kernel. Currently investigating.

peacey commented 3 years ago

Hi @tusc, any updates on this? I tried building the module for the latest 1.10.0-13 release that just came out a few hours ago. Unfortunately the same issue happened as -12 where the UDMP crashed when the wireguard link is added (ip link add wg0 type wireguard triggers the crash). This didn't happen with the modules built for -11 and before.

I was really hoping it was a temporary issue that would be gone in -13, but it seems more like something permanent now :( Hopefully you can find a solution for it.

Crash says the kernel panic happens in register_netdevice. Since that function is part of the kernel, it might be an error in the dev structure that is passed to that function in wireguard's wg_newlink. Maybe some memory alignment issues with how they compiled their kernel vs. us? Or maybe they modified something in their new kernel that is causing a bug, though we can't know without the new kernel source.

peacey commented 3 years ago

Hi @tusc! I figured it out! I have a working module now for 1.10.0-13, link comes up successfully and traffic flows through with no issues. Here's the .ko module if you want to try it on 1.10.0-13: wireguard-kmod-1.10.0-13.zip

I was able to debug this by extracting the kernel from the uBoot Image (found by mounting /dev/sda1) using dumpimage, converting the kernel image to an ELF file with symbols using the vmlinux-to-elf tool, and then disassembling the kernel with objdump.

I disassembled the register_netdevice function and saw that it was looking up the netdev_ops structure using an offset of #520 in the net_device structure for the new kernel (1.10.0-13), but for the old kernel (1.10.0-11) it was using an offset of #504 (a difference of 16 bytes). So I figured in the new kernel, they must have changed something in their net_device structure to offset the fields, such as adding a variable or two to it.

I then disassembled the wireguard module that we compiled, and saw it was using an offset of #504 for the netdev_ops structure. So the kernel headers we are compiling against didn't have this new offset, which was causing memory alignment issues in the net_device structure.

After many hours of looking up all the offsets for the net_device structure members in the disassembled kernels by cross-referencing the C code of different functions where those variables existed with the assembly code, I believe I figured out where they added the variable and to what. They added one new variable to the net_device structure, and one to net_device->stats structure (probably to fix the LLDP drop counter bug mentioned in the 1.10.0-12 changelog. Also, when I was debugging, I stumbled upon dev_get_stats() and saw that they also added the new variable at the end of the rtnl_link_stats64 and rtnl_link_stats structures to coincide with the net_device->stats structure.

So I modified my kernel headers to add an extra variable to these structures in the correct place, and when I disassembled the fixed wireguard module, I saw it was using the correct offset for netdev_ops within the net_device structure. I loaded this module on 1.10.0-13 and added the wireguard link, and it worked! Hoozah!!!

Now, my guesses for the locations of the variables were empirical but not perfect, so I hope there won't be any consequences when looking at device statistics like memory leaks or crashes. The statistics for the wireguard interface still seem to work and look right when I check them (using ip -s link show wg0), though I haven't tested extensively (I'm not sure how to generate dropped or error packets for example). Also, I don't believe they changed anything else in the kernel, though it's practically very difficult to find out unless it crashes eventually and tells us where. So far no crashes have happened in over an hour with the tunnel up, so that's a good sign.

Sorry for the long comment, I will do a pull request tonight with the changes to the repo, but I've also attached the patch file if you're curious.

tusc commented 3 years ago

@peacey I appreciate the contribution and hard work at this. I will go ahead and accept the pull request. I've been without a UDM pro for a few weeks now since I had to submit mine for an RMA and recently received a replacement.

My Git LFS data quota was just exceeded this month. I will have to figure out another way of hosting the kernel source. Speaking of which, UBNT recently sent me the -13 kernel source. Do you have it? I will post the google drive link they sent.

Finally, are you on the discord Ubiquiti discord channel?

peacey commented 3 years ago

Thank you @tusc. I use this script all the time, so least I could is contribute. Thank you for creating this script!

For the kernel storage, how is the bandwidth quota with Google drive? We could host it on multiple mirrors.

Also, that's great that Ubiquiti sent you the -13 source. That means we don't have to have any guesswork in the kernel headers like in my temporary patch, and we can confirm if they did indeed add the extra variable like I guessed.

We should probably recompile -12 and -13 with the new headers for production release just in case. If you send them over, I can add the new sources to the script, which is really easy now that we have multi base support.

And no I'm not on the Ubiquiti Discord. I didn't know there was one. What's the channel?

tusc commented 3 years ago

This is the email I received from github re: LFS quota:

You’ve used 100% of your data plan for Git LFS on your personal account tusc. Please purchase additional data packs to cover your bandwidth and storage usage:

https://github.com/account/billing/data/upgrade

Current usage as of 22 Jun 2021 09:52PM UTC:

Bandwidth: 1.04 GB / 1 GB (104%)

I just upgraded to a paid plan in the interim. We could just try replacing the LFS files with the google drive links that UBNT sends out upon GPL requests.

I just emailed you the -12 source google drive link.

Discord invite link is available from the top of this thread: https://www.reddit.com/r/Ubiquiti/comments/6dbqfg/ubiquiti_discord_channel/