vishvananda / netlink

Simple netlink library for go.
Apache License 2.0
2.76k stars 719 forks source link

Subscribe and its variants run into a busy loop taking 100% CPU when groups is not empty #975

Open tnqn opened 2 months ago

tnqn commented 2 months ago

My colleage @hongliangl tried to upgrade netlink to a recent commit to pick up a required change. However, our CI became flaky when validating the new netlink version. We identified the flake started from the commit merged via https://github.com/vishvananda/netlink/pull/941.

The above commit changes the socket created by Subscribe to non-blocking when groups are provided. However, it didn't change how the message was received from the socket, causing the receiver goroutine to run into a busy loop, taking 100% CPU. https://github.com/vishvananda/netlink/blob/856e190dd707c02002dcdf6434424ef8af375ada/addr_linux.go#L367-L372

The issue can be reproduced by the following code:

// netlink-reproducer.go
package main

import (
    "fmt"
    "os"

    "github.com/vishvananda/netlink"
)

func main() {
    ch := make(chan netlink.AddrUpdate, 1000)
    stopCh := make(chan struct{})
    defer close(stopCh)
    if err := netlink.AddrSubscribeWithOptions(ch, stopCh, netlink.AddrSubscribeOptions{
        ErrorCallback: func(err error) {
            fmt.Fprintf(os.Stderr, "Received err: %v\n", err)
        },
    }); err != nil {
        fmt.Fprintf(os.Stderr, "Failed to subscribe: %v\n", err)
    }
    for {
        select {
        case update := <-ch:
            fmt.Fprintf(os.Stderr, "Received: %v\n", update)
        }
    }
}

The process would always take 100%+ CPU:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 412687 root      20   0 1968136   8172   2660 S 103.6   0.0   0:15.53 netlink-reprodu

To fix it, all subscribers' receiver goroutines should use poll or select to wait for events first before receiving messages from the socket. I could take a stab at fixing the implementation, but I wonder if we could revert the commit that introduces the bug first to unblock projects requiring other changes of the library.

tnqn commented 2 months ago

cc @aboch @kuroa-me

kuroa-me commented 2 months ago

Apologies for my sluppy implementation, this was such a basic mistake...

@tnqn Do you mind fixing the implementation after the revert? Would be a great opportunity for me to learn.

tnqn commented 2 months ago

@tnqn Do you mind fixing the implementation after the revert? Would be a great opportunity for me to learn.

Sure, I'm working on a fix, will ping you once I open a PR. Sorry for reverting the commit, I'm just not sure how long it takes to land the fix but we need some changes in the recent commit to unblock our project.