vishvananda / netlink

Simple netlink library for go.
Apache License 2.0
2.76k stars 723 forks source link

*Subscribe* family of functions leaks goroutines #542

Open jncornett opened 4 years ago

jncornett commented 4 years ago

Contingent on how long it takes to actually start reading from the socket fd, the following code will never return:

package main

import (
    "time"

    "github.com/vishvananda/netlink"
)

func main() {
    updates := make(chan netlink.AddrUpdate)
    done := make(chan struct{})
    netlink.AddrSubscribe(updates, done)
    time.Sleep(time.Second)
    close(done)
    <-updates
}

This is apparently because:

  1. there is no cancellation mechanism in the call to s.Receive()1.
  2. the initialization code does not set a read timeout for the socket (via, e.g. setsockopt).

This isn't a big deal for short-lived programs (as the goroutines will get cleaned up on exit), but I was writing a long lived daemon that creates and removes subscriptions based on configuration file updates. With the current behavior, it is not possible to write such a program without leaking goroutines via the above mechanism.

jncornett commented 4 years ago

I don't have the context to know if this makes sense for all use cases, but a simple fix would be to duplicate nl.NetlinkSocket.Receive to a new method nl.NetlinkSocket.ReceiveWithTimeout. Inside of this new method set a userspace timeout via unix.Select(...) immediately prior to the call to unix.Recvfrom(...). The *Subscribe* family of functions could use this new ReceiveWithTimeout method in place of Receive. If the return value of the error is Timeout (I know, checking error return values is anti-pattern), then the loop would just continue. As far as I know, this seems like the least invasive change to "cancel" a recvfrom()/recvmsg() call.

vincentbernat commented 2 years ago

unix.RecvFrom() can be unblocked by calling unix.Shutdown() with unix.SHUT_RD (or unix.SHUT_RDWR if you also want to unblock unix.SendTo()).

vincentbernat commented 2 years ago

I didn't test, but I think this patch would help:

diff --git a/nl/nl_linux.go b/nl/nl_linux.go
index 600b942b1785..72104b93ef2b 100644
--- a/nl/nl_linux.go
+++ b/nl/nl_linux.go
@@ -734,6 +734,7 @@ func SubscribeAt(newNs, curNs netns.NsHandle, protocol int, groups ...uint) (*Ne

 func (s *NetlinkSocket) Close() {
        fd := int(atomic.SwapInt32(&s.fd, -1))
+       unix.Shutdown(fd, unix.SHUT_RDWR)
        unix.Close(fd)
 }

In fact, it seems this is not even needed on recent kernel anymore. Just close(done) seems to do the trick.

Hendrik-H commented 2 years ago

I noticed the same issue on Fedora 35 with kernel 5.18.6-100.fc35.x86_64. Shouldn't that be recent enough?

maxiloEmmmm commented 8 months ago

SHUT_RDWR netlink return operation not supported

method of setting recv timeout is also ok

image image

before edit, seeing that some goroutines did not exit

image

after edit, look good

image