Open tnqn opened 2 months ago
cc @aboch @kuroa-me
Apologies for my sluppy implementation, this was such a basic mistake...
@tnqn Do you mind fixing the implementation after the revert? Would be a great opportunity for me to learn.
@tnqn Do you mind fixing the implementation after the revert? Would be a great opportunity for me to learn.
Sure, I'm working on a fix, will ping you once I open a PR. Sorry for reverting the commit, I'm just not sure how long it takes to land the fix but we need some changes in the recent commit to unblock our project.
My colleage @hongliangl tried to upgrade netlink to a recent commit to pick up a required change. However, our CI became flaky when validating the new netlink version. We identified the flake started from the commit merged via https://github.com/vishvananda/netlink/pull/941.
The above commit changes the socket created by
Subscribe
to non-blocking when groups are provided. However, it didn't change how the message was received from the socket, causing the receiver goroutine to run into a busy loop, taking 100% CPU. https://github.com/vishvananda/netlink/blob/856e190dd707c02002dcdf6434424ef8af375ada/addr_linux.go#L367-L372The issue can be reproduced by the following code:
The process would always take 100%+ CPU:
To fix it, all subscribers' receiver goroutines should use poll or select to wait for events first before receiving messages from the socket. I could take a stab at fixing the implementation, but I wonder if we could revert the commit that introduces the bug first to unblock projects requiring other changes of the library.