nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

Jetstream server panic #2644

Closed minight closed 2 years ago

minight commented 2 years ago

When publishing messages to jetstream, we noticed our nats jetstream nodes would occasionally crash with the following panic.

[fatal error: sync: Unlock of unlocked RWMutex](<fatal error: sync: Unlock of unlocked RWMutex

goroutine 4066 [running]:
runtime.throw(0xb0a29a, 0x20)
        /home/travis/.gimme/versions/go1.16.8.linux.amd64/src/runtime/panic.go:1117 +0x72 fp=0xc001702f50 sp=0xc001702f20 pc=0x437d12
sync.throw(0xb0a29a, 0x20)
        /home/travis/.gimme/versions/go1.16.8.linux.amd64/src/runtime/panic.go:1103 +0x35 fp=0xc001702f70 sp=0xc001702f50 pc=0x46c3f5
sync.(*RWMutex).Unlock(0xc001640a80)
        /home/travis/.gimme/versions/go1.16.8.linux.amd64/src/sync/rwmutex.go:142 +0xc6 fp=0xc001702fa8 sp=0xc001702f70 pc=0x489ba6
github.com/nats-io/nats-server/server.(*consumer).processNextMsgReq.func3(0xc001640a80, 0xc0024d5a10, 0xc00280b540, 0xc000412000)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/consumer.go:1950 +0x50 fp=0xc001702fc0 sp=0xc001702fa8 pc=0x9a2150
runtime.goexit()
        /home/travis/.gimme/versions/go1.16.8.linux.amd64/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc001702fc8 sp=0xc001702fc0 pc=0x471341
created by github.com/nats-io/nats-server/server.(*consumer).processNextMsgReq
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/consumer.go:1947 +0x669>)

Looking at the source, it appears we have an extra mutex unlock.

The mutex has already a deferred unlock: https://github.com/nats-io/nats-server/blob/v2.6.1/server/consumer.go#L1862-L1863 which is manually unlocked again here https://github.com/nats-io/nats-server/blob/v2.6.1/server/consumer.go#L1940

Running: nats:2.6.1-alpine on k8s

Config:

    jetstream {
      max_mem: 1Gi
      store_dir: /data

      max_file:10Gi
    }

Not sure what message or subscriber triggers this as we are sending a few thousand messages a second.

derekcollison commented 2 years ago

IIRC this is fixed in 2.6.2.

minight commented 2 years ago

I've updated to nats 2.6.2, and i'm still receiving a panic with a slightyl different stack trace

[80] 2021/10/24 20:51:28.288830 [INF] Starting nats-server
[80] 2021/10/24 20:51:28.288861 [INF]   Version:  2.6.2
[80] 2021/10/24 20:51:28.288864 [INF]   Git:      [f7c3ac5]
[80] 2021/10/24 20:51:28.288867 [INF]   Name:     nats-js-2
[80] 2021/10/24 20:51:28.288871 [INF]   Node:     CsHLmbQ9
[80] 2021/10/24 20:51:28.288873 [INF]   ID:       NBYV5M3RLYSJX2Q535WL2TAVNMEJVHBLACPACM27TBIUQVRHVGDO6XPH
[80] 2021/10/24 20:51:28.288876 [WRN] Plaintext passwords detected, use nkeys or bcrypt
[80] 2021/10/24 20:51:28.288879 [INF] Using configuration file: /etc/nats-config/nats.conf
[80] 2021/10/24 20:51:28.291675 [INF] Starting JetStream
[80] 2021/10/24 20:51:28.292969 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[80] 2021/10/24 20:51:28.293146 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[80] 2021/10/24 20:51:28.293183 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[80] 2021/10/24 20:51:28.293239 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[80] 2021/10/24 20:51:28.293271 [INF]
[80] 2021/10/24 20:51:28.293289 [INF]          https://docs.nats.io/jetstream
[80] 2021/10/24 20:51:28.293331 [INF]
[80] 2021/10/24 20:51:28.293363 [INF] ---------------- JETSTREAM ----------------
[80] 2021/10/24 20:51:28.293392 [INF]   Max Memory:      1.00 GB
[80] 2021/10/24 20:51:28.293455 [INF]   Max Storage:     10.00 GB
[80] 2021/10/24 20:51:28.293472 [INF]   Store Directory: "/data/jetstream"
[80] 2021/10/24 20:51:28.293560 [INF] -------------------------------------------

.....
<snip>
...
[80] 2021/10/24 20:53:09.018619 [WRN] JetStream API limit exceeded: 1025 calls outstanding
fatal error: sync: Unlock of unlocked RWMutex

goroutine 157 [running]:
runtime.throw(0xb0e8a7, 0x20)
        /home/travis/.gimme/versions/go1.16.9.linux.amd64/src/runtime/panic.go:1117 +0x72 fp=0xc0021d1158 sp=0xc0021d1128 pc=0x437d12
sync.throw(0xb0e8a7, 0x20)
        /home/travis/.gimme/versions/go1.16.9.linux.amd64/src/runtime/panic.go:1103 +0x35 fp=0xc0021d1178 sp=0xc0021d1158 pc=0x46c435
sync.(*RWMutex).Unlock(0xc000e8aa80)
        /home/travis/.gimme/versions/go1.16.9.linux.amd64/src/sync/rwmutex.go:142 +0xc6 fp=0xc0021d11b0 sp=0xc0021d1178 pc=0x489be6
github.com/nats-io/nats-server/server.(*consumer).processNextMsgReq(0xc000e8aa80, 0xc002057a40, 0xc0001ab300, 0xc0000c8d80, 0xc003326300, 0x40, 0xc000a5d4c0, 0x1d, 0xc002d15422, 0x20, ...)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/consumer.go:1944 +0x61d fp=0xc0021d12c0 sp=0xc0021d11b0 pc=0x7e951d
github.com/nats-io/nats-server/server.(*consumer).processNextMsgReq-fm(0xc002057a40, 0xc0001ab300, 0xc0000c8d80, 0xc003326300, 0x40, 0xc000a5d4c0, 0x1d, 0xc002d15422, 0x20, 0xebde)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/consumer.go:1860 +0xb4 fp=0xc0021d1328 sp=0xc0021d12c0 pc=0x9c5cf4
github.com/nats-io/nats-server/server.(*client).deliverMsg(0xc0001ab300, 0xc002057a40, 0xc0000c8d80, 0xc002d153bf, 0x40, 0xec41, 0xc002d15400, 0x1d, 0xec00, 0xc0001ac781, ...)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3147 +0xa7c fp=0xc0021d1540 sp=0xc0021d1328 pc=0x7d107c
github.com/nats-io/nats-server/server.(*client).processMsgResults(0xc0001ab300, 0xc0000c8d80, 0xc002df4510, 0xc002d15422, 0x22, 0xebde, 0x0, 0x0, 0x0, 0xc002d153bf, ...)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4122 +0x5ca fp=0xc0021d1a08 sp=0xc0021d1540 pc=0x7d732a
github.com/nats-io/nats-server/server.(*client).processInboundRoutedMsg(0xc0001ab300, 0xc002d15422, 0x22, 0xebde)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/route.go:443 +0x173 fp=0xc0021d1ac0 sp=0xc0021d1a08 pc=0x9517b3
github.com/nats-io/nats-server/server.(*client).processInboundMsg(0xc0001ab300, 0xc002d15422, 0x22, 0xebde)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3460 +0x6c fp=0xc0021d1af8 sp=0xc0021d1ac0 pc=0x7d2bec
github.com/nats-io/nats-server/server.(*client).parse(0xc0001ab300, 0xc002d14000, 0x10000, 0x10000, 0x10000, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/parser.go:477 +0x2525 fp=0xc0021d1d88 sp=0xc0021d1af8 pc=0x924565
github.com/nats-io/nats-server/server.(*client).readLoop(0xc0001ab300, 0x0, 0x0, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:1174 +0x5f8 fp=0xc0021d1fb0 sp=0xc0021d1d88 pc=0x7c25d8
github.com/nats-io/nats-server/server.(*Server).createRoute.func1()
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/route.go:1375 +0x3b fp=0xc0021d1fe0 sp=0xc0021d1fb0 pc=0x9b8a3b
runtime.goexit()
        /home/travis/.gimme/versions/go1.16.9.linux.amd64/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc0021d1fe8 sp=0xc0021d1fe0 pc=0x471381
created by github.com/nats-io/nats-server/server.(*Server).startGoRoutine
derekcollison commented 2 years ago

Apologies it was legit. Fix coming. Thanks!