lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.67k stars 2.07k forks source link

signal: data-race during shutdown #6137

Open Crypt-iQ opened 2 years ago

Crypt-iQ commented 2 years ago

It failed to restore the stack, so not sure if the output is very useful.

==================
WARNING: DATA RACE
Read at 0x00000381a6c0 by goroutine 21:
  github.com/lightningnetwork/lnd/signal.(*Interceptor).mainInterruptHandler()
      /Users/nsa/lnd/signal/signal.go:177 +0x2de
  github.com/lightningnetwork/lnd/signal.Intercept·dwrap·1()
      /Users/nsa/lnd/signal/signal.go:139 +0x39

Previous write at 0x00000381a6c0 by main goroutine:
  [failed to restore the stack]

Goroutine 21 (running) created at:
  github.com/lightningnetwork/lnd/signal.Intercept()
      /Users/nsa/lnd/signal/signal.go:139 +0x2d0
  main.main()
      /Users/nsa/lnd/cmd/lnd/main.go:14 +0x3a
==================
==================
WARNING: DATA RACE
Read at 0x00c00079a6c0 by goroutine 21:
  github.com/lightningnetwork/lnd.(*replaceableLogger).Infof()
      <autogenerated>:1 +0x64
  github.com/lightningnetwork/lnd/signal.(*Interceptor).mainInterruptHandler()
      /Users/nsa/lnd/signal/signal.go:177 +0x311
  github.com/lightningnetwork/lnd/signal.Intercept·dwrap·1()
      /Users/nsa/lnd/signal/signal.go:139 +0x39

Previous write at 0x00c00079a6c0 by main goroutine:
  [failed to restore the stack]

Goroutine 21 (running) created at:
  github.com/lightningnetwork/lnd/signal.Intercept()
      /Users/nsa/lnd/signal/signal.go:139 +0x2d0
  main.main()
      /Users/nsa/lnd/cmd/lnd/main.go:14 +0x3a
==================
==================
WARNING: DATA RACE
Read at 0x00c00000e660 by goroutine 21:
  github.com/lightningnetwork/lnd/build.(*ShutdownLogger).Infof()
      <autogenerated>:1 +0x64
  github.com/lightningnetwork/lnd.(*replaceableLogger).Infof()
      <autogenerated>:1 +0x92
  github.com/lightningnetwork/lnd/signal.(*Interceptor).mainInterruptHandler()
      /Users/nsa/lnd/signal/signal.go:177 +0x311
  github.com/lightningnetwork/lnd/signal.Intercept·dwrap·1()
      /Users/nsa/lnd/signal/signal.go:139 +0x39

Previous write at 0x00c00000e660 by main goroutine:
  [failed to restore the stack]

Goroutine 21 (running) created at:
  github.com/lightningnetwork/lnd/signal.Intercept()
      /Users/nsa/lnd/signal/signal.go:139 +0x2d0
  main.main()
      /Users/nsa/lnd/cmd/lnd/main.go:14 +0x3a
==================
l0k18 commented 2 years ago

Is there any specific hints for conditions that trigger this or does it always happen once everything is up and running?

Crypt-iQ commented 2 years ago

It happens when everything is up and running and then trying to stop lnd via ctrl-c. What I do to build with data-race is call make install, grab the line that begins with go install -v ... and add a -race after the -v. So for me that looks like:

go install -v -race -tags="" -ldflags " -s -w -X github.com/lightningnetwork/lnd/build.Commit=v0.15.0-beta-185-g167d3be37 -X github.com/lightningnetwork/lnd/build.CommitHash=167d3be3793948dd740c09f1fff4f32f7f7b55d9 -X github.com/lightningnetwork/lnd/build.GoVersion=go1.18.2 -X github.com/lightningnetwork/lnd/build.RawTags=" github.com/lightningnetwork/lnd/cmd/lnd
l0k18 commented 2 years ago

Looking over the Makefile I have just become aware there is no option in there for non-test builds with the race flag on. But I can also immediately see from the race detection you have posted that the race is in the logger. I am going to experiment with running a standard build with race detection turned on in order to see what is going on. I have a btcd fully synced here on my LAN also.

l0k18 commented 2 years ago

I have now replicated the problem, and I can see that indeed the issue comes from there being only one logger created in the signal package and two goroutines using it simultaneously.

// Intercept starts the interception of interrupt signals and returns an `Interceptor` instance.
// Note that any previous active interceptor must be stopped before a new one can be created.
func Intercept() (Interceptor, error) {
    if !atomic.CompareAndSwapInt32(&started, 0, 1) {
        return Interceptor{}, errors.New("intercept already started")
    }

    channels := Interceptor{
        interruptChannel:       make(chan os.Signal, 1),
        shutdownChannel:        make(chan struct{}),
        shutdownRequestChannel: make(chan struct{}),
        quit:                   make(chan struct{}),
    }

    signalsToCatch := []os.Signal{
        os.Interrupt,
        os.Kill,
        syscall.SIGTERM,
        syscall.SIGQUIT,
    }
    signal.Notify(channels.interruptChannel, signalsToCatch...)
    go channels.mainInterruptHandler()

    return channels, nil
}

My guess is the solution for this is in instantiating another logger for use inside the goroutine channels.mainInterruptHandler(). The 'log' variable is being accessed by this goroutine and the main thread at the same time. It is not a critical issue but also very simple to resolve. The correct location to instantiate the second logger would be within that handler function.

I have dug a bit deeper to find that the logger usage is a little complicated here, my recommendation would be to not use it for ainInterruptHandler at all but instead just write to stderr directly. I will test to see if disabling the log prints inside this goroutine removes the race.

Initial test just commenting out all of the calls to the logger inside that function show it does indeed fix the problem. Not sure how to apply the fix, however.

This is one option that is simplest:

// mainInterruptHandler listens for SIGINT (Ctrl+C) signals on the
// interruptChannel and shutdown requests on the shutdownRequestChannel, and
// invokes the registered interruptCallbacks accordingly. It also listens for
// callback registration.
// It must be run as a goroutine.
func (c *Interceptor) mainInterruptHandler() {
    defer atomic.StoreInt32(&started, 0)
    // isShutdown is a flag which is used to indicate whether or not
    // the shutdown signal has already been received and hence any future
    // attempts to add a new interrupt handler should invoke them
    // immediately.
    var isShutdown bool

    // This is a fix for a race condition that occurs due to the logger
    // being used in two goroutines simultaneously in this handler.
    log := struct {
        Infof func(format string, a ...interface{})
    }{
        Infof: func(format string, a ...interface{}) {
            fmt.Fprintf(os.Stderr, format, a...)
        },
    }

    // shutdown invokes the registered interrupt handlers, then signals the
    // shutdownChannel.
    shutdown := func() {
        // Ignore more than one shutdown signal.
        if isShutdown {
            log.Infof("Already shutting down...")
            return
        }
        isShutdown = true
        log.Infof("Shutting down...")
        c.Notifier.notifyStop()

        // Signal the main interrupt handler to exit, and stop accept
        // post-facto requests.
        close(c.quit)
    }

    for {
        select {
        case signal := <-c.interruptChannel:
            log.Infof("Received %v", signal)
            shutdown()

        case <-c.shutdownRequestChannel:
            log.Infof("Received shutdown request.")
            shutdown()

        case <-c.quit:
            log.Infof("Gracefully shutting down.")
            close(c.shutdownChannel)
            signal.Stop(c.interruptChannel)
            return
        }
    }
}

It seems strange to me that this race doesn't happen elsewhere that one package spawns multiple goroutines. It may be currently concealed by default 'info' level not having multiple single package loggers being accessed at the same time by multiple goroutines, the solution would obviously be putting a mutex on the root logger variable and changing all printers to flip that lock when printing. Or extracting the variable out from the shared slice so the slice isn't being raced for.