sasha-s / go-deadlock

Online deadlock detection in go (golang)
Apache License 2.0
1.03k stars 76 forks source link

Question: potential deadlock with goroutine stuck on internal lock? #15

Open Thegaram opened 5 years ago

Thegaram commented 5 years ago

Hey, first of all thanks for the hard work on this great lib!

I'm having trouble interpreting the output below. It suggests goroutine 77264 holds lock 0xc4202a60e0 for a long time, preventing others (like goroutine 77325 and many more) from acquiring it.

However, the output suggests that goroutine 77264 actually got stuck during unlock: raft.go:688 is a deferred mu.Unlock(), and deadlock.go:330 is actually is a lock acquire statement in this lib.

Does this mean that the (potential) deadlock is coming from this lib in this case? What would make goroutine 77264 get stuck on that internal lock? (I reproduced the same output with 1m30s lock timeout.)

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 77264 lock 0xc4202a60e0
../raft/raft.go:618 raft.(*Raft).AppendEntriesRPCHandler { rf.mu.Lock() } <<<<<
/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:509 runtime.call32 { CALLFN(·call32, 32) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:434 reflect.Value.call { call(frametype, fn, args, uint32(frametype.size), uint32(retOffset)) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:302 reflect.Value.Call { return v.call("Call", in) }
../labrpc/labrpc.go:478 labrpc.(*Service).dispatch { function.Call([]reflect.Value{svc.rcvr, args.Elem(), replyv}) }
../labrpc/labrpc.go:402 labrpc.(*Server).dispatch { return service.dispatch(methodName, req) }
../labrpc/labrpc.go:229 labrpc.(*Network).ProcessReq.func1 { r := server.dispatch(req) }

Have been trying to lock it again for more than 30s
goroutine 77325 lock 0xc4202a60e0
../raft/raft.go:618 raft.(*Raft).AppendEntriesRPCHandler { rf.mu.Lock() } <<<<<
/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:509 runtime.call32 { CALLFN(·call32, 32) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:434 reflect.Value.call { call(frametype, fn, args, uint32(frametype.size), uint32(retOffset)) }
/usr/local/Cellar/go/1.9/libexec/src/reflect/value.go:302 reflect.Value.Call { return v.call("Call", in) }
../labrpc/labrpc.go:478 labrpc.(*Service).dispatch { function.Call([]reflect.Value{svc.rcvr, args.Elem(), replyv}) }
../labrpc/labrpc.go:402 labrpc.(*Server).dispatch { return service.dispatch(methodName, req) }
../labrpc/labrpc.go:229 labrpc.(*Network).ProcessReq.func1 { r := server.dispatch(req) }

Here is what goroutine 77264 doing now
goroutine 77264 [semacquire]:
sync.runtime_SemacquireMutex(0xc420084744, 0xa900000000)
    /usr/local/Cellar/go/1.9/libexec/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc420084740)
    /usr/local/Cellar/go/1.9/libexec/src/sync/mutex.go:134 +0x14c
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc420084740, 0x1288e00, 0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:330 +0x3f
github.com/sasha-s/go-deadlock.postUnlock(0x1288e00, 0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:167 +0x5f
github.com/sasha-s/go-deadlock.(*Mutex).Unlock(0xc4202a60e0)
    .../src/github.com/sasha-s/go-deadlock/deadlock.go:97 +0x7f
raft.(*Raft).AppendEntriesRPCHandler(0xc4202a60e0, 0xc420beb840, 0xc421834620)
    .../src/raft/raft.go:688 +0xcc8

...
sasha-s commented 5 years ago

Hi. Looking at the output, it seems you are right. Somehow the l.mu got locked, so we are stuck on deadlock.go:330. I do not see how it could happen. I would look at the dump of the other goroutines, to see if any of them are doing something in deadlock.go (under the l.mu lock). Or, if you have something reproducible, I'd love to look at it.

sasha-s commented 5 years ago

I found one bug in the library, hopefully fixed it in https://github.com/sasha-s/go-deadlock/commit/1595213edefa28ca5047b00340c63557f9c051d0 Not sure it would solve your problem, but could you pull and try again?

zhiqiangxu commented 4 years ago

I just implemented a deadlock detector with easy to prove correctness:

https://github.com/zhiqiangxu/util/tree/master/deadlock

dunglas commented 4 years ago

I'm facing the same problem while debugging Mercure:

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 5212824 lock 0xc0001aa780
hub/bolt_transport.go:120 hub.(*BoltTransport).Dispatch { t.Lock() } <<<<<
hub/subscribe.go:201 hub.(*Hub).dispatchSubscriptionUpdate { h.transport.Dispatch(u) }
hub/subscribe.go:180 hub.(*Hub).shutdown { h.dispatchSubscriptionUpdate(s, false) }
hub/subscribe.go:49 hub.(*Hub).SubscribeHandler { return }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 http.HandlerFunc.ServeHTTP { f(w, r) }
../go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 mux.(*Router).ServeHTTP { handler.ServeHTTP(w, req) }
../go/pkg/mod/github.com/gorilla/handlers@v1.4.2/cors.go:54 handlers.(*cors).ServeHTTP { ch.h.ServeHTTP(w, r) }
../go/pkg/mod/github.com/unrolled/secure@v1.0.8/secure.go:177 secure.(*Secure).Handler.func1 { h.ServeHTTP(w, r) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 http.HandlerFunc.ServeHTTP { f(w, r) }
../go/pkg/mod/github.com/gorilla/handlers@v1.4.2/logging.go:45 handlers.loggingHandler.ServeHTTP { h.handler.ServeHTTP(logger, req) }
../go/pkg/mod/github.com/gorilla/handlers@v1.4.2/recovery.go:78 handlers.recoveryHandler.ServeHTTP { h.handler.ServeHTTP(w, req) }
../go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 mux.(*Router).ServeHTTP { handler.ServeHTTP(w, req) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2807 http.serverHandler.ServeHTTP { handler.ServeHTTP(rw, req) }
/usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:1895 http.(*conn).serve { serverHandler{c.server}.ServeHTTP(w, w.req) }

Have been trying to lock it again for more than 30s
goroutine 5373639 lock 0xc0001aa780

Here is what goroutine 5212824 doing now
goroutine 5212824 [semacquire]:
sync.runtime_SemacquireMutex(0xc00000ee24, 0x0, 0x1)
    /usr/local/Cellar/go/1.14.3/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00000ee20)
    /usr/local/Cellar/go/1.14.3/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
    /usr/local/Cellar/go/1.14.3/libexec/src/sync/mutex.go:81
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc00000ee20, 0x16feaa0, 0xc0001aa780)
    /Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:330 +0xa6
github.com/sasha-s/go-deadlock.postUnlock(...)
    /Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:167
github.com/sasha-s/go-deadlock.(*RWMutex).Unlock(0xc0001aa780)
    /Users/dunglas/workspace/go/pkg/mod/github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:129 +0x66
github.com/dunglas/mercure/hub.(*BoltTransport).Dispatch(0xc0001aa780, 0xc001522600, 0x0, 0x0)
    /Users/dunglas/workspace/mercure/hub/bolt_transport.go:133 +0x251
github.com/dunglas/mercure/hub.(*Hub).dispatchSubscriptionUpdate(0xc000298420, 0xc002c1f5e0, 0xc000f81000)
    /Users/dunglas/workspace/mercure/hub/subscribe.go:201 +0x2f2
github.com/dunglas/mercure/hub.(*Hub).shutdown(0xc000298420, 0xc002c1f5e0)
    /Users/dunglas/workspace/mercure/hub/subscribe.go:180 +0x4c
github.com/dunglas/mercure/hub.(*Hub).SubscribeHandler(0xc000298420, 0x2bb4068, 0xc00102f260, 0xc000fae900)
    /Users/dunglas/workspace/mercure/hub/subscribe.go:49 +0x66e
net/http.HandlerFunc.ServeHTTP(0xc000280ea0, 0x2bb4068, 0xc00102f260, 0xc000fae900)
    /usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 +0x44
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0001dc180, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
    /Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 +0xe2
github.com/gorilla/handlers.(*cors).ServeHTTP(0xc0001942d0, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
    /Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/handlers@v1.4.2/cors.go:54 +0x102a
github.com/unrolled/secure.(*Secure).Handler.func1(0x2bb4068, 0xc00102f260, 0xc002d4f500)
    /Users/dunglas/workspace/go/pkg/mod/github.com/unrolled/secure@v1.0.8/secure.go:177 +0xdf
net/http.HandlerFunc.ServeHTTP(0xc00000ff80, 0x2bb4068, 0xc00102f260, 0xc002d4f500)
    /usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2012 +0x44
github.com/gorilla/handlers.loggingHandler.ServeHTTP(0x18849a0, 0xc000010020, 0x1885560, 0xc00000ff80, 0x17b87b0, 0x18907a0, 0xc002c1f500, 0xc002d4f500)
    /Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/handlers@v1.4.2/logging.go:45 +0x243
github.com/gorilla/handlers.recoveryHandler.ServeHTTP(0x18850a0, 0xc000298de0, 0x1883d00, 0xc0002a1b20, 0x1, 0x18907a0, 0xc002c1f500, 0xc002d4f500)
    /Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/handlers@v1.4.2/recovery.go:78 +0xce
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0001dc0c0, 0x18907a0, 0xc002c1f500, 0xc000a19e00)
    /Users/dunglas/workspace/go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210 +0xe2
net/http.serverHandler.ServeHTTP(0xc0001c62a0, 0x18907a0, 0xc002c1f500, 0xc000a19e00)
    /usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2807 +0xa3
net/http.(*conn).serve(0xc0035ebcc0, 0x1892f60, 0xc002980200)
    /usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:1895 +0x86c
created by net/http.(*Server).Serve
    /usr/local/Cellar/go/1.14.3/libexec/src/net/http/server.go:2933 +0x35c

I'm using the same version of the library. Thanks for writing such an useful tool btw!

ghostiam commented 3 years ago

v0.2.1 (1595213edefa)

This blocking occurs in the caching layer, which is heavily loaded, about 100,000 requests per second. The blocking does not work immediately, sometimes it takes up to a week, and sometimes the next day after the restart

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 1595446618 lock 0xc00048e7d0
<project>/pkg/stracing/mutex.go:48 stracing.(*Mutex).Lock ??? <<<<<
...

Have been trying to lock it again for more than 5m0s
goroutine 1595163670 lock 0xc00048e7d0

Here is what goroutine 1595446618 doing now
goroutine 1595446618 [semacquire]:
sync.runtime_SemacquireMutex(0xc0001108a4, 0xd00, 0x1)
 runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0001108a0)
 sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
 sync/mutex.go:81
github.com/sasha-s/go-deadlock.(*lockOrder).postUnlock(0xc0001108a0, 0xc54900, 0xc00048e7d0)
 github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:330 +0xa6
github.com/sasha-s/go-deadlock.postUnlock(...)
 github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:167
github.com/sasha-s/go-deadlock.(*Mutex).Unlock(0xc00048e7d0)
 github.com/sasha-s/go-deadlock@v0.2.1-0.20190427202633-1595213edefa/deadlock.go:97 +0x65
...
ghostiam commented 3 years ago

But in this case, there is a very small timeout, while in my real code there is 5 minutes.

package main

import (
    "time"

    "github.com/sasha-s/go-deadlock"
)

func main() {
    deadlock.Opts.DeadlockTimeout = 1 * time.Second

    var mx deadlock.Mutex

    fn := func() {
        mx.Lock()
        mx.Unlock()
    }

    for i := 0; i < 100_000; i++ {
        go func() {
            for {
                fn()
            }
        }()
    }

    select {}
}

Perhaps the problem is not in this library, but in some other place. But in my real case, mutex protected counting the number in a slice and adding to a slice.

// copy code from https://github.com/jaegertracing/jaeger/blob/748bf213d1e5a03ebc4da63601473be3bbe193f4/examples/hotrod/pkg/tracing/mutex.go with my fixes
type Mutex struct {
    ...
    waiters     []string
    waitersLock deadlock.Mutex // <--
}

func (sm *Mutex) Lock(ctx context.Context, name string) {
    activeSpan := opentracing.SpanFromContext(ctx) // "activeSpan == noopSpan{}" in production

    sm.waitersLock.Lock() // <-- "Previous place where the lock was grabbed"
    if waiting := len(sm.waiters); waiting > 0 && activeSpan != nil {
        activeSpan.LogFields(...) // <-- func (n noopSpan) LogFields(fields ...log.Field) {}
    }
    sm.waiters = append(sm.waiters, name)
    sm.waitersLock.Unlock() // <-- "Have been trying to lock it again for more than 5m0s"

    sm.realLock.Lock()
    sm.holder = name

    sm.waitersLock.Lock()
    behindLen := len(sm.waiters) - 1
    sm.waitersLock.Unlock()

    if activeSpan != nil {
        activeSpan.LogFields(...)
    }
}