slackhq / nebula

A scalable overlay networking tool with a focus on performance, simplicity and security
MIT License
14.66k stars 982 forks source link

On OSX startup TUN route not created #407

Closed jwestbrook closed 1 year ago

jwestbrook commented 3 years ago

As I deployed nebula to my workstation fleet of around 30 workstations, I noticed on 2-3 of them that the TUN device was created and received the ip address, the device registered with the lighthouse, but the route was not created. As soon as I ran route add with the correct subnet and interface, traffic started flowing to/from the device.

Its probably some fluke or conflict with my current VPN setup (that is being replaced by this setup), but I want to make sure I run the correct commands to fix the route in the future.

Should I bounce the nebula service ? Restart the device? Reinstall ? Is there a specific log entry that shows the route command failed ?

edit: also I have 2 other items, Apple Remote Desktop only works when the firewall is set to all ports/protocols allow (but works well), and I would like statsd stats output as well

MikePadge commented 3 years ago

I umm... did not know ARD was still in use :P (We gotta laugh about it... it's sad how ignored it's been)

re: Apple Remote Desktop, I'm not sure what automation you might have setup, but if you're looking at it from strictly a remote access software you can just expose the RDP ports locally to each workstation and reverse them over an ssh session. That way you're only leaving ssh open on each endpoint.

here's an ~/.ssh/config example I use for vnc

Host Hulk
Hostname 10.10.10.10
User banner
Identityfile "~/.ssh/id_whateva"
LocalForward 5901 127.0.0.1:5901
ServerAliveInterval 240 
TCPKeepAlive no

Now if I pull up screensharing and go to localhost at "127.0.0.1:5901", I'm pulling up vnc served from the remote machine over my ssh tunnel.

RDP port is TCP 5988 I believe

caguiclajmg commented 3 years ago

Is there a specific log entry that shows the route command failed ?

Looking at the code, you should be able to see a failed to run 'route add': XXX in the logs

https://github.com/slackhq/nebula/blob/7a9f9dbded135947abb5f55acdf1befbe484bc91/tun_darwin.go#L56-L58

I suggest dumping the command that the lines above generate and running it manually if it works. Might be some weird parameter naming inconsistency in /sbin/route between macOS versions or something, could also be a permission issue.

jwestbrook commented 3 years ago

So here's more specifics --

For ARD, I opened all the ports that Apple listed as ports that ARD uses 5900, 3283, 22 UDP and TCP, and then even ANY protocol. I could connect to device and ping the device, but the device would not be listed as online in the ARD window unless I opened all the ports and all the protocols.

For the route problem, essentially how I figured out is the device showed in the debug logs on the lighthouse as connected and online, and ifconfig showed the ip address configured and attached to the TUN device. But there was no route for the specific subnet to send traffic over the TUN. As soon as I added the correct route for the subnet the device showed up in ARD and was pingable. As I mentioned above, there might be a conflict with the current VPN, but I would think everything should be able to coexist. After I send the route command to run both connections coexist until logout/reboot so some other process might be removing the route or flushing the route table? -- More to the question, restarting the nebula process doesnt refresh the route because I'm guessing the device already exists so it thinks the route is fine? Just wanting to know if there is a built in way to clean it up, or should I continue to test if I can ping the lighthouse and if not fix the route?

caguiclajmg commented 3 years ago

You already made your point in the initial post that manually adding the route corrected the issue, however the reason I suggested you to dump the command being generated by the above snippet is to debug if the command nebula is executing to add the route is valid in the first place.

You also still haven't mentioned the exact command you used to add the route, although I'm expecting it to be the plain ip route add 192.168.0.0/24 dev nebula1, the key question here is why can't nebula execute this command or if it can why is the route disappearing (as you said probably an external application messing with the routes).

restarting the nebula process doesnt refresh the route because I'm guessing the device already exists so it thinks the route is fine?

nebula doesn't do a check to see if the route is in the routing table, it tries to insert it every time it starts. Most likely it just fails again just like the first time you started it.

Just wanting to know if there is a built in way to clean it up

No, routes attached to a device automatically get torn down once the corresponding interface is taken down (at least on Linux). This would mean that when you stop nebula, the corresponding route entry should also be deleted (try ip route before and after stopping the tunnel).

jwestbrook commented 3 years ago

here's the route command I run

route add -net 10.100.40/24 -interface $TUNDEVICE

Honestly I've searched through the logs on the individual devices (/var/log/system.log) and I dont see very many if any log items from the nebula service even with the log level at debug. Though I will point out the generated plist that gets installed with nebula -service install is missing the log items StandardOutPath and StandardErrorPath but I'm not sure if those are missing should it redirect the output to OSX Apple System Logs?

Just for fun, I killed nebula on my device so launchd would restart it and see what was output. I didnt get any logs in Console.app, and only the single line in system.log

com.apple.xpc.launchd[1] (com.nebula[41395]): Service exited due to SIGKILL | sent by kill[41526]
caguiclajmg commented 3 years ago

Unfortunately I do not currently have access to a macOS machine to help you out further nor am I familiar with how stderr/stdout redirection works for macOS packages. It would be great if you can run this under a debugger to see what exactly fails, but it seems you already have something worked out in the meantime so I guess it's now a total showstopper.

Just some final thoughts, assuming there isn't another (conflicting) VPN messing around with the routes I would place my money on a permission issue trying to execute the route addition or if for some reason c.Cidr.String() or c.Device have invalid values causing the command to fail.

johnmaguire commented 1 year ago

Closed for inactivity. This doesn't seem to be something other users are experiencing. If this is an issue you're still dealing with, please ping me to reopen the ticket and we can assist with further debugging. Thanks!