prometheus-community / pro-bing

A library for creating continuous probers
MIT License
309 stars 52 forks source link

Return error instead of logging and continuing in runLoop() #40

Closed floatingstatic closed 1 year ago

floatingstatic commented 1 year ago

I noticed when running pro-bing with a condition that triggers an error such as in a "no route to host" scenario that the run loop logs and continues the loop. This can be misleading to callers and should probably return an error. There are two sections in runLoop() with FIXME comments that explain this behavior. It would be more correct to return an error here.

This PR replaces logger.Fatalf() with a return of a wrapped error to remediate this. We use %w to wrap the error so callers can unwrap and inspect the errors with something like:

if err := pinger.RunWithContext(ctx); err != nil {
    if neterr, ok := errors.Unwrap(err).(*net.OpError); ok {
        // do something interesting here with the error
    }
}
SuperQ commented 1 year ago

""no route to host"" should not stop the loop.

floatingstatic commented 1 year ago

Hi @SuperQ , so without this patch log.Fatalf() gets called (but does not actually exit, just continues). Here is output from a ping to an ipv6 address with an empty ipv6 routing table (no default):

2023/05/24 21:25:53 FATAL: sending packet: write udp [::]:0->[2001:db::1]:0: sendto: no route to host
2023/05/24 21:25:54 FATAL: sending packet: write udp [::]:0->[2001:db::1]:0: sendto: no route to host
2023/05/24 21:25:55 FATAL: sending packet: write udp [::]:0->[2001:db::1]:0: sendto: no route to host
2023/05/24 21:25:56 FATAL: sending packet: write udp [::]:0->[2001:db::1]:0: sendto: no route to host
2023/05/24 21:25:57 FATAL: sending packet: write udp [::]:0->[2001:db::1]:0: sendto: no route to host

I'm not sure if you meant that you did not think this would occur (it does) or that you do not feel the behavior should be to exit the loop in this circumstance. Can you clarify?

There are several cases in addition to "no route to host" that we get errors sending and it seems preferable to exit the loop in these cases as they are generally not recoverable. One would be after introducing #39 where we set do-not-fragment and send a packet with a size larger than the MTU we get an error sending (net.OpError): sendto: message too long. There is no point in retrying/continuing in this case. Another case would be sending a packet through an interface that is down which looks something like this:

write udp 0.0.0.0:47784->203.0.113.255:0: sendto: invalid argument

I can understand that this may change the expected behavior for consumers of this library. If that is the concern I wonder if you would be open to some alternatives as for my use-cases it is preferable to abort immediately when encountering errors sending (or receiving) but perhaps this does not apply to everyone. Some alternatives that I can see:

Lastly I apologize for the high number of PRs the past few days. My company has an internal application that uses a fork of go-ping/ping with internal patches from about 5 years ago. Given go-ping is no longer maintained we want to pull in pro-bing with a number of our internal patches upstreamed, at least the ones that make sense to share for general use. Thanks for your time reviewing these, I appreciate it.

SuperQ commented 1 year ago

Normal /usr/bin/ping also returns this error without stopping the loop.

$ /usr/bin/ping 2001:db::1
PING 2001:db::1(2001:db::1) 56 data bytes
From X::Y icmp_seq=2 Destination unreachable: No route
From X::Y icmp_seq=12 Destination unreachable: No route
^C
--- 2001:db::1 ping statistics ---
12 packets transmitted, 0 received, +2 errors, 100% packet loss, time 11239ms
SuperQ commented 1 year ago

Lastly I apologize for the high number of PRs the past few days. My company has an internal application that uses a fork of go-ping/ping with internal patches from about 5 years ago. Given go-ping is no longer maintained we want to pull in pro-bing with a number of our internal patches upstreamed, at least the ones that make sense to share for general use. Thanks for your time reviewing these, I appreciate it.

More patches are great, thanks for contributing.

floatingstatic commented 1 year ago

Normal /usr/bin/ping also returns this error without stopping the loop

Right, that makes sense but i need a way to signal that this occurs in the ping response. Its obvious via stdout but not in the struct returned by a pinger. I need to be able to distinguish that 100% packet loss was due to some underlying condition reported by linux and not a path problem. Are you open to adding a boolean to selectively enable returning an error? I'll submit another commit that does this and you can let me know what you think of the implementation. Thanks!

SuperQ commented 1 year ago

Sounds like we need another packet handler function registry. Something like:

if handler := p.OnPacketError; handler != nil {
    handler()
}
floatingstatic commented 1 year ago

I did this: https://github.com/prometheus-community/pro-bing/pull/40/commits/7268dfe273a5496cd328971cb8ebdbd75a7ec667

... but seeing your latest comment perhaps that is a better approach. Happy to go that route if its preferable.

SuperQ commented 1 year ago

I need to think about this, and how other error cases are handled.

floatingstatic commented 1 year ago

One brief thought, say we add something like this to the pinger:

OnPacketError func(*Packet, error)

One problem with this is it really only works for sending packets as we do not have packet info on the receiving side if we hit an error after icmp.ParseMessage() (for example).

We could also do this to make it more generic to both sending and receiving:

OnPacketError func(error)

but then we miss out on the packet details in the case where we have that info.

Maybe a handler for sending and another for receiving makes more sense?

OnSendError func(*Packet, error)
OnRecvError func(error)

For my use case, in either one of these functions I could call pinger.Stop() in a closure when i hit an error which meets my needs.

floatingstatic commented 1 year ago

Proposing #44 as an alternative to this PR based on comments from recent discussion

floatingstatic commented 1 year ago

Closing in favor of #44