netlink socket created with pid of the process

stolowski commented 6 years ago

Hi! I've an issue when using go-udev in the unit tests of our project; we have multiple tests that instantiate our components and then destroy them etc; every time a go-udev monitor is created internally and then destroyed. The problem is that despite the fact that I stop the monitor via the "quit" channel and then close the netlink socket via its Close() method, I cannot create a new instance of UEventConn and Connect gives me "address aleady in use error". I'm not too familiar with netlink sockets, but by doing some experiments I found out that the problem is in assigning Pid of the process when creating the socket:

c.Addr = syscall.SockaddrNetlink{
        Family: syscall.AF_NETLINK,
        Groups: 1, // TODO: demistify this field because msg receive if Groups != 0
        Pid:    uint32(os.Getpid()),
}

This effectively prevents your application (at the kernel level) from having more than one netlink socket opened at a time . I think that what I'm seeing in my tests is a race where even though the structures are destroyed in the tests, the socket is still reserved for that pid in the kernel when I'm already trying to create and connect a new one. Does that make sense?

FWIW, here is an article https://medium.com/@mdlayher/linux-netlink-and-go-part-1-netlink-4781aaeeaca8 that says: "In my experience, it is much easier to just let netlink assign all PIDs itself, and make sure you keep track of which numbers it assigns for each socket." So as a quick test I removed Pid assignment from the above call and voilla, all works. However, if this explanation sounds viable, then I suspect there some more work needed to prevent Connect from being called multiple times on same UEventConn instance.

What do you think? Any idea how to avoid the above race, or am I missing something more obvious? Thanks!

stolowski commented 6 years ago

Upon further investigation (https://people.redhat.com/nhorman/papers/netlink.pdf) setting the Pid to own pid may be correct and actually good and the article I mentioned earlier might be wrong... Not sure, digging in...

pilebones commented 6 years ago

Good question, I had read a lot of documentations about the Pid (port id) field in a SockaddrNetlink struct and the sense (the usage) of this field is a little bit opaque to netlink core. Besides, that is what is said in the manual page of netlink http://man7.org/linux/man-pages/man7/netlink.7.html about nlmsg_seq and nlmsg_pid fields. I was looking some implementation of netlink usage inside kernel like busybox, see: https://elixir.bootlin.com/busybox/1.27.2/source/util-linux/uevent.c#L56 In this case, the pid of the process is also used.

I had also read this paper (ie: netlink.pdf) to be more precise, this part :

nl_pid - This field is PID of the process that should receive the frame being sent, or the process which sent the frame being received. Set this value to the PID of the process you wish to recieve the frame, or to zero for a multicast message or to have the kernel process the message

But the manual of netlink said : Note that there isn't a 1:1 relationship between nlmsg_pid and the PID of the process if the message originated from a netlink socket.

So as you can see, the answer is not so easy.

And until your request, the project had no need to subscribe to the netlink channel at multiple times. I'm not sure that this have a sense (outside a unit test) to have multiple subscriber inside the same application / binary. But I can find a way to pass the port Id of the netlink addr, maybe by splitting the code of Connect() or externalize bind() method. I will think about it. I'll keep you informed soon.

Best

stolowski commented 6 years ago

Thanks for reply. Yeah I see the answer may be complicated, so I think it's wise not to rush any changes in this area, perhaps it is ok the way it is now. This problem is currently only affecting our tests. I've no use case for multiple netlink sockets, it just turned out hard to prevent them in the tests. I'm sure I can tweak our test code, will keep investigating it.

pilebones commented 6 years ago

Did you manage to re-implement or find another solution to avoid this race in your unit tests ?

At this time, according to your answer, I will not apply changes to fix this behavior because I'm not convinced that have a sense to have many subscriber at the same time inside the same executable. But I keep in mind this limitation to enhance this usability in case of future unit test.

stolowski commented 6 years ago

Yes, I did manage to overcome the problem by mocking my monitor code in the tests, which makes sense anyway.. I think you can close the issue for now, thank you!

pilebones commented 6 years ago

Thanks for replied, I'm thrilled you have found a way to solve this problem.

debfx commented 5 years ago

Please reconsider fixing this.

I'm not convinced that have a sense to have many subscriber at the same time inside the same executable.

This affects not just NETLINK_KOBJECT_UEVENT but also makes it very hard to use other kinds of netlink sockets in the same process.

The kernel documentation is quite clear about the nl_pid field. From netlink(7):

There are two ways to assign nl_pid to a netlink socket. If the application sets nl_pid before calling bind(2), then it is up to the application to make sure that nl_pid is unique. If the application sets it to 0, the kernel takes care of assigning it.

The kernel can take care of assigning a unique nl_pid so I don't see the advantage of the current approach. libudev / udevadm also set nl_pid=0 so it certainly wouldn't be the first library to do so.

pilebones commented 3 years ago

Fixed thanks to @debfx

pilebones / go-udev

netlink socket created with pid of the process #3