liftbridge-io / liftbridge

Lightweight, fault-tolerant message streams.
https://liftbridge.io
Apache License 2.0
2.58k stars 107 forks source link

SIGBUS error cause liftbridge panic #315

Closed hl3w22bupt closed 2 years ago

hl3w22bupt commented 3 years ago

I launched liftbridge server several days ago, and writes some data through nats-pub client continuously, unfortunately, liftbridge server paniced and exited because of fatal error: fault SIGBUS from the callstack dump. I have encountered panic problem before, such as fatal error: runtime: out of memory, It is intuitive and easy to understand, I could avoid out of memory by controlling the write qps. but in this time, the SIGBUS panic error may implies some unsafe access. So I doubt there maybe been some bug in unsafe mmap when process commitlog, Is there anyone else encountered this?

By the way, the liftbridge cluster only contains 1 server which bootstraps as raft seed with command option --raft-bootstrap-seed in my case

goroutine 317597 [chan receive, 36 minutes]:
github.com/nats-io/nats%2ego.(*Conn).flusher(0xc000244100)
    /opt/run/compile_path/pkg/mod/github.com/nats-io/nats.go@v1.10.0/nats.go:2392 +0xef
created by github.com/nats-io/nats%2ego.(*Conn).processConnectInit
    /opt/run/compile_path/pkg/mod/github.com/nats-io/nats.go@v1.10.0/nats.go:1490 +0x1b9
unexpected fault address 0x7f6e81372020
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f6e81372020 pc=0x472d6a]

goroutine 183 [running]:
runtime.throw(0x1072163, 0x5)
    /usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc0002b8b40 sp=0xc0002b8b10 pc=0x439412
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:739 +0x485 fp=0xc0002b8b70 sp=0xc0002b8b40 pc=0x44fa05
runtime.memmove(0x7f6e81372000, 0xc021a2ea00, 0x5000)
    /usr/local/go/src/runtime/memmove_amd64.s:365 +0x42a fp=0xc0002b8b78 sp=0xc0002b8b70 pc=0x472d6a
github.com/liftbridge-io/liftbridge/server/commitlog.(*index).writeAt(0xc000253110, 0xc021a2ea00, 0x5000, 0x53ec, 0x172000, 0xc01f5caac0, 0x0)
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/commitlog/index.go:213 +0xbf fp=0xc0002b8c08 sp=0xc0002b8b78 pc=0xcd401f
github.com/liftbridge-io/liftbridge/server/commitlog.(*index).writeEntries(0xc000253110, 0xc01e5c4000, 0x400, 0x400, 0x0, 0x0)
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/commitlog/index.go:141 +0x24f fp=0xc0002b8ca0 sp=0xc0002b8c08 pc=0xcd392f
github.com/liftbridge-io/liftbridge/server/commitlog.(*segment).WriteMessageSet(0xc00020fc30, 0xc021fbc000, 0x233000, 0x23b750, 0xc01e5c4000, 0x400, 0x400, 0x0, 0x0)
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/commitlog/segment.go:224 +0x14e fp=0xc0002b8d20 sp=0xc0002b8ca0 pc=0xcdca2e
github.com/liftbridge-io/liftbridge/server/commitlog.(*commitLog).append(0xc0005cc200, 0xc00020fc30, 0xc021fbc000, 0x233000, 0x23b750, 0xc01e5c4000, 0x400, 0x400, 0xc01e5c4000, 0x400, ...)
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/commitlog/commitlog.go:253 +0x85 fp=0xc0002b8d98 sp=0xc0002b8d20 pc=0xccbe85
github.com/liftbridge-io/liftbridge/server/commitlog.(*commitLog).Append(0xc0005cc200, 0xc000356000, 0x400, 0x400, 0xc01e34a001, 0x400, 0x400, 0x0, 0x0)
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/commitlog/commitlog.go:233 +0x15c fp=0xc0002b8e20 sp=0xc0002b8d98 pc=0xccbb9c
github.com/liftbridge-io/liftbridge/server.(*partition).messageProcessingLoop(0xc0005d6480, 0xc0005c45a0, 0xc000172a80, 0x8)
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/partition.go:837 +0x343 fp=0xc0002b8f78 sp=0xc0002b8e20 pc=0xe0a8a3
github.com/liftbridge-io/liftbridge/server.(*partition).becomeLeader.func1()
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/partition.go:480 +0x4d fp=0xc0002b8fb0 sp=0xc0002b8f78 pc=0xe1e84d
github.com/liftbridge-io/liftbridge/server.(*Server).startGoroutine.func1(0xc00000f960, 0xc000492000)
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/server.go:945 +0x27 fp=0xc0002b8fd0 sp=0xc0002b8fb0 pc=0xe1f9c7
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0002b8fd8 sp=0xc0002b8fd0 pc=0x471b41
created by github.com/liftbridge-io/liftbridge/server.(*Server).startGoroutine
    /opt/run/compile_path/src/github.com/liftbridge-io/liftbridge/server/server.go:944 +0x9a

Again, my liftbridge server is v1.5.0, golang version is 1.15.5

tylertreat commented 3 years ago

After a cursory investigation, it appears this may be due to limited disk space while flushing mmap. My guess is this is the same issue. Is it possible your node is low on disk space when this error occurs?

hl3w22bupt commented 3 years ago

After a cursory investigation, it appears this may be due to limited disk space while flushing mmap. My guess is this is the same issue. Is it possible your node is low on disk space when this error occurs?

emm, not sure but possible. I will notice the disk space if it reoccur

tylertreat commented 2 years ago

Closing this as I believe it is due to disk space. Please re-open if this is not the case.