slackhq / nebula

A scalable overlay networking tool with a focus on performance, simplicity and security
MIT License
14.01k stars 939 forks source link

🐛 BUG: Event Log spam when handshake timeout fails #1149

Closed lostmsu closed 3 weeks ago

lostmsu commented 1 month ago

What version of nebula are you using? (nebula -version)

1.9.0

What operating system are you using?

Windows

Describe the Bug

Nebula generates an entry in Event Log/Application/Nebula every ~10 seconds when it can't handshake with a remote host.

Logs from affected hosts

I get enormous number of messages like this one:

time="2024-05-23T15:33:28-07:00" level=info msg="Handshake timed out" durationNs=6728059500 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=322120572 localIndex=322120572 remoteIndex=0 udpAddrs="[]" vpnIp=192.168.44.255

Config files from affected hosts

N/A

SamSirry commented 1 month ago

I don't think this is a bug. This log message is of an "info" level. Perhaps you will want to consider changing the logging level to warning or error.

lostmsu commented 1 month ago

@SamSirry it is not technically broken, but service should be user friendly in the default configuration. Either it should not send messages to Event Log, or this particular message should have its level lowered, or the logging frequency should be reduced for this message.

johnmaguire commented 1 month ago

Hi @lostmsu - can you help me understand the motivation for preventing these log messages from appearing in the log? We believe that when things are working correctly, you shouldn't need to read the log file. There are many more debug level messages, but the handshake messages are intentionally info level as they are highly pertinent when dealing with support tickets.

The log messages occur frequently because a process on your system is attempting to reach out to the node in question and failing. If your host were not trying to access the remote host, or if you were able to handshake successfully, these log messages would not occur. As it stands, there seems to be a potential problem in your network, and that's what these logs are intended to help with.

lostmsu commented 1 month ago

Because when Nebula is running as a Windows service these entries are not going to a Nebula-specific log file, they go to system-wide Event Log: Application.

Here's that log of a healthy human: image Under normal operation you get maybe 10 messages an hour.

Here's the log of a human with Nebula service when one of the peers is a sleeping laptop: image ~1000+ messages from Nebula/hour

Basically it makes it nearly impossible to investigate issues with other applications because the whole thing is flooded with Nebula's messages, which clearly have verbosity too high.

BTW, "Refusing to handshake with myself" also definitely should not be an error. That is an event that happens during normal operation and is of no concern to the user unless they want to debug connectivity in the extremely rare possibility that Nebula thinks the other host is itself while it is not. This should also be a debug-level message.

johnmaguire commented 4 weeks ago

Thanks for explaining the impact to your workflow.

BTW, "Refusing to handshake with myself" also definitely should not be an error. That is an event that happens during normal operation and is of no concern to the user unless they want to debug connectivity in the extremely rare possibility that Nebula thinks the other host is itself while it is not. This should also be a debug-level message.

This log message occurs when a host attempt to handshake with a peer that presents a certificate with the same VPN IP (Nebula IP) as itself. It is a misconfiguration to have multiple hosts on a network with the same IP address. Therefore this generally means something has gone very wrong, either in the network configuration or internally to Nebula.

Can you explain the normal operation that you're seeing this in?

lostmsu commented 4 weeks ago

@johnmaguire I am using a public lighthouse is the only potential reason that comes to mind for "Refusing to handshake with myself". Seeing the message I just assumed it is similar to the behavior of BitTorrent protocol where when you get peer endpoints from a tracker, and try to connect to all of them, you might hit your own endpoint you might not otherwise know about. For instance: you listen behind NAT on [local IP:your port], and you get [public router IP:your port] endpoint which router actually forwards to [local IP:your port]. So when you try to connect to [public router IP:your port], the handshake would notice that the remote peer ID is the same as yours and you drop the connection.

johnmaguire commented 4 weeks ago

Thank you again for the context.

For instance: you listen behind NAT on [local IP:your port], and you get [public router IP:your port] endpoint which router actually forwards to [local IP:your port].

The Lighthouse learns underlay IP/port combos from hosts when nodes report their IP addresses. Nodes look at all IP addresses assigned to them (e.g. ipconfig/ifconfig/ip addr) and combine this list with what's configured in advertise_addrs before sending them to the Lighthouse. In addition to the self-reported addresses, the Lighthouse looks at the "source" field of packets to learn of a possible NAT IP:port combo. The Lighthouse stores these in a map: vpnIP: [IP:port, IP:port, IP:port]

When behind a NAT, each device behind the NAT will receive a different public port associated with the Nebula instance. So, [public router IP:your port] is only assigned to one device behind the NAT. If another host reports the same IP address, it indicates a misconfiguration (e.g. in advertise_addrs.)

It is possible that two hosts on separate LANs will report the same private IP address / port combo. Indeed, this is more likely with public Lighthouses, since they are tailored towards home users which likely all use the 192.168.0.0/16 range. FYI, the Nebula project does not support public Lighthouses. Lighthouses, as designed, were not intended to be shared between organizations, and networks should not reuse IP addresses. You can read through #642 and #306 to understand some of the challenges here. (For example, a Lighthouse can only hold on to so many underlay IP addresses [20?] for a given vpn IP - so if you have multiple hosts with the same vpn IP sharing a Lighthouse, you may never learn the underlay IP addresses for the node you are trying to access.)

While it is possible that two hosts on separate LANs will get the same IP address and choose the same port on the LAN (especially if configured with a static port), the message would still only appears when these two hosts attempt to handshake with each other. I suspect this is not very common - and this log message is intended to uncover issues such as misconfigured static_host_map entries, DNS entries, routing rules, and misbehaving NATs. I don't expect we'll change the log level of "Refusing to handshake with myself" to debug-level.


Regarding your initial issue with excessive Nebula logs appearing in the Event Viewer, this occurs when you install Nebula as a service using the "nebula-service" utility (packaged as nebula.exe on Windows). We expect most users want Nebula logs to appear in the Event Viewer.

One option you have is to simply restrict the log verbosity such that only warning and error messages are logged.

Alternatively, instead of installing Nebula with nebula.exe -service install -config <C:\path\to\config.yml>, you can simply run Nebula using nebula.exe -config <C:\path\to\config.yml> which will write to stdout/stderr and avoid writing to the Event Viewer. You may need to put in a little effort to create a service to run this at startup. Something like winsw could be helpful in creating a service that writes to a separate log file, as you prefer.

lostmsu commented 4 weeks ago

The "Refusing to handshake with myself" message is outside the scope of my initial bug report, if you wish we can move to discussions for it. TBH I do not understand the scenario where this message appears. E.g. what is the node doing when it eventually ends up printing this message? If it has to do with other nodes having same vpnIP, then why would a node even try to obtain that vpnIP: [IP1:port, IP2:port, IP3:port] list from the lighthouse for its own vpnIP? And then why would it try to handshake with these IPx:port guys knowing they have the same vpnIP just to immediately refuse to "handshake with myself"?

Re: public lighthouses. I would expect this to work similar to SNI. A single node with a single public IP:port pair could potentially be a lighthouse for any number of Nebula overlay networks by responding based on the requester's certificate under the assumption that every overlay network would have a different CA.

We expect most users want Nebula logs to appear in the Event Viewer. restrict the log verbosity

I thought we already discussed that and the issue is that I also want logs to be present, because, for example, I might want to know when a peer is connected/disconnected (definitely info level). But I don't need to know of every single failed attempt to connect to a remote peer, especially if it happens 3 times a minute. Who would unless they are debugging?

johnmaguire commented 4 weeks ago

Re: public lighthouses. I would expect this to work similar to SNI. A single node with a single public IP:port pair could potentially be a lighthouse for any number of Nebula overlay networks by responding based on the requester's certificate under the assumption that every overlay network would have a different CA.

This suggestion is addressed in the linked thread: https://github.com/slackhq/nebula/issues/306#issuecomment-1707015209

That being said, we have no plans to support public Lighthouses.

I thought we already discussed that and the issue is that I also want logs to be present, because, for example, I might want to know when a peer is connected/disconnected (definitely info level). But I don't need to know of every single failed attempt to connect to a remote peer, especially if it happens 3 times a minute. Who would unless they are debugging?

In my experience, most people only look at Nebula logs when they are debugging. You seem to have a different preference for log levels than others do. These log levels have been set based on years of debugging Nebula issues. I'm not exactly sure what you mean by "I might want to know when a peer is connected/disconnected" but the debug SSH server may be a better fit for that goal. I understood the specific problem you are having is that they are causing a lot of noise in Event Viewer, which is why I suggested logging to a file instead:

it is not technically broken, but service should be user friendly in the default configuration. Either it should not send messages to Event Log, or this particular message should have its level lowered, or the logging frequency should be reduced for this message.

johnmaguire commented 3 weeks ago

After discussing with the wider team, we feel that these log levels are too important to move behind a "debug" level. They are invaluable in understanding and debugging users' connectivity issues in Nebula support tickets and on the Nebula OSS Slack channel.

We advise decreasing the log verbosity as needed or disconnecting Nebula from Event Viewer if it interferes with your workflow.