Expose health status - Githubissues

erikschul commented 7 months ago

I would like to use wireproxy with kubernetes, but for the http proxy, I'm missing a health endpoint:

/livez:

200 if tunnel is established (e.g. no config error, unresolvable dns, etc.)
503 otherwise

/readyz:

200 if latest keepalive received
503 otherwise

This will use wireguard config PersistentKeepalive for period. Kubernetes can use failureThreshold to ignore intermittent failures.

leros1337 commented 5 months ago

nice to see some metrics in prometheus format too

pufferffish commented 2 months ago

nice to see some metrics in prometheus format too

The kubernetes health endpoint seems easy to implement but I'm not familiar with Prometheus. How would metrics in Prometheus look like?

erikschul commented 2 months ago

I haven't implemented a metrics endpoint before, but it looks like...

https://prometheus.io/docs/instrumenting/exposition_formats/

that maybe it's as trivial as

<metric name> <float value> <epoch milisecond int64>

example

live 1 1395066363000
ready 1 1395066363000

pufferffish commented 2 months ago

okay I've given this some thoughts:

I'm actually not sure how to implement the /livez endpoint, since wireproxy would just crash currently if there is a config error for example. I feel like a /livez endpoint would just always return 200 since any error would just cause wireproxy to not start? I'm not sure what is the best approach here.

As for the /readyz endpoint, I guess I can just implement it as "200 if I have received any packet from a peer within the PersistentKeepalive period". Although I'm not sure how to handle this when PersistentKeepalive isn't set, or more obscurely, some peers have PersistentKeepalive set while some don't.

Would appreciate some feedbacks so I can implement this right.

erikschul commented 2 months ago

I argue that:

there's no need for a livez endpoint
when readyz is enabled via a CLI argument, it should require PersistentKeepalive to be configured for all peers
Anyone needing readyz with a peer without PersistentKeepalive would have to document their use case and suggest a meaningful solution.

fmierlo commented 2 months ago

It would be useful to expose some "wg show dump" information such as endpoint, latest-handshakes and transfer + timestamp of the last keepalive + error messages by peer to collect metrics and detect nat timeout/ip changes of devices behind NAT.

erikschul commented 2 months ago

@fmierlo I agree, although technically a separate issue, and with much larger scope, than health status.

Such information should be exposed in a separate endpoint, e.g. /metrics.

pufferffish commented 2 months ago

Hmm, I've implemented a /metrics which would give you as much information as you would get with wg show. Would be nice if I can make wireguard-tools works with wireproxy, but that'd require root permission to create a unix socket in /var/run which kinds of defeat the permission-less purpose of wireproxy. If enough people find it useful I guess I can just add it as a feature optionally turned on by a flag.

As for the /readyz endpoint: it seems like we're all kinda confused how PersistentKeepalive works. I was under the assumption that peers would respond with a keep alive packet upon receiving one, but this seems to be not the case. The keep alive packet is just sent so it would keep the NAT mapping alive so there's no obligation for the peer to respond with one. So unless we assume the peer has also set a PersistentKeepalive there's no way for us to detect whether the tunnel is alive? We also need to keep track of keep alive packets by the peer's PersistentKeepalive and not our PersistentKeepalive, so that might introduce some complications.

Another thing is that apparently wireguard doesn't support a nice way to check the timestamp of the last received packet, and only gives us the timestamp of the last handshake packet. Keep alive packets aren't handshake packets, so there's no nice way to check the last received keep alive packet progammatically (it is possible, we can just listen to the log, but that's a really bad way of doing this imo).

I think another way we can implement the /readyz endpoint is this: Each peer has new options CheckAlive and CheckAliveInterval. CheckAlive takes in an IP, and sends ICMP ping periodically according to CheckAliveInterval, which is defaulted to 5 seconds. /readyz would return 200 if all peers with CheckAlive configured have received an ICMP pong within the last CheckAliveInterval. Any opinions?

fmierlo commented 2 months ago

I agree @erikschul.

Nice work @pufferffish https://github.com/pufferffish/wireproxy/compare/master...k8s-status

It is very difficult to detect NAT failures because several things can be the cause: IP change, port exhaustion, modem reset, etc. But having some pragmatic information helps.

erikschul commented 2 months ago

As for the /readyz endpoint: it seems like we're all kinda confused how PersistentKeepalive works. I was under the assumption that peers would respond with a keep alive packet upon receiving one, but this seems to be not the case. The keep alive packet is just sent so it would keep the NAT mapping alive so there's no obligation for the peer to respond with one. So unless we assume the peer has also set a PersistentKeepalive there's no way for us to detect whether the tunnel is alive? We also need to keep track of keep alive packets by the peer's PersistentKeepalive and not our PersistentKeepalive, so that might introduce some complications.

That certainly complicates things.

I think that if someone wants to use the readyz feature, then they have to be willing to guarantees certain aspects of the configuration, i.e. if it fails with the wrong configuration (lack of reciprocal PersistentKeepalive) then that's their problem, especially if there's no other good solution.

I think another way we can implement the /readyz endpoint is this: Each peer has new options CheckAlive and CheckAliveInterval. CheckAlive takes in an IP, and sends ICMP ping periodically according to CheckAliveInterval, which is defaulted to 5 seconds. /readyz would return 200 if all peers with CheckAlive configured have received an ICMP pong within the last CheckAliveInterval. Any opinions?

Some Wireguard endpoints are not reachable via ICMP, or the traffic is forwarded and the ICMP response is for the wrong host. What about sending ICMP from wireproxy, over the wireguard connection, to the server's local ip? I still think that the built-in keepalive packages are more suitable.

I don't use more than 1 peer, so I can't vouch for that use case. I guess it makes sense that readyz should fail if not all peers are working, although that makes a rather fragile setup, where all connections will fail if any connection fails, so I doubt that anyone would use that.

erikschul commented 2 months ago

@pufferffish This is a bit dangerous

if pair[0] == "private_key" || pair[0] == "preshared_key" {
                pair[1] = "REDACTED"
            }

If the name "private-key" name changes, the key could be exposed. Or theoretically other information that is later added, but not caught. I recommend an inclusive approach instead, to explicitly name what you think is safe to include.

pufferffish commented 2 months ago

Some Wireguard endpoints are not reachable via ICMP, or the traffic is forwarded and the ICMP response is for the wrong host. What about sending ICMP from wireproxy, over the wireguard connection, to the server's local ip?

Yes I should've phrased it better. My idea is that you can specify an address with CheckAlive, and wireproxy would send ICMP pings over wireguard to the CheckAlive address at the interval specified by CheckAliveInterval.

I still think that the built-in keepalive packages are more suitable.

This is true, but I don't think wireguard's API exposes when we last received a keep-alive packet. The only way to get this information afaik is to just parse the log that wireguard is printing, but that's a really bad approach to this imo.

I guess another justification for ICMP over wireguard is that just because you are receiving keep-alives doesn't necessarily mean the tunnel is working, for example your peer might've misconfigured some firewall rules which makes it unable to forward IP packets from their wireguard interface to their local network interfaces. ICMP pings would catch that.

If the name "private-key" name changes, the key could be exposed. Or theoretically other information that is later added, but not caught. I recommend an inclusive approach instead, to explicitly name what you think is safe to include.

Can you clarify what you mean by an inclusive approach? Also these names are pretty much baked into wireguard, the wg(8) tools rely on these names being constant in order for wg show to work. If other information is added later that would be if there's a new version of wireguard that added some new information, and in that case we can just modify this code when we update the wireguard dependency.

erikschul commented 2 months ago

My idea is that you can specify an address with CheckAlive, and wireproxy would send ICMP pings over wireguard to the CheckAlive address at the interval specified by CheckAliveInterval.

I agree that this is probably the best solution.

When using third-party VPN services, e.g. ProtonVPN, and the server's local ip is unknown, perhaps it would work just as well by pinging a public address like 1.1.1.1, as long as it's guaranteed to be done over the wireguard connection.

leros1337 commented 2 months ago

You can research this repo if you are familiar rust - https://github.com/kbknapp/wireguard_exporter, its all about prometheus metrics getting from wg show. The idea that your application gives metrics without any external exporters

pufferffish / wireproxy

Expose health status #96