mochi-mqtt / server

The fully compliant, embeddable high-performance Go MQTT v5 server for IoT, smarthome, and pubsub
MIT License
1.3k stars 223 forks source link

badger crashed on server Close #298

Closed fangdingjun closed 1 year ago

fangdingjun commented 1 year ago

Hi,

I use badger to store the data, when I try to restart the server, sometimes crashed.

I make a demo that can reproduce the issue.

test step

  1. start the server

  2. start some clients connect to server and keep sending messages I use jmeter with mqtt plugin, 10 threads, infinite loop

  3. press CTRL+C to close the server

after ctrl+c the server crashed and print the stack

version

golang 1.20.4  linux/amd64
mochi-mqtt  v2.3.0

this is the demo code

package main

import (
        "os"
        "os/signal"
        "syscall"

        "github.com/mochi-mqtt/server/v2"
        "github.com/mochi-mqtt/server/v2/hooks/auth"
        "github.com/mochi-mqtt/server/v2/hooks/storage/badger"
        "github.com/mochi-mqtt/server/v2/listeners"
)

func main() {
        server := mqtt.New(nil)
        server.AddListener(listeners.NewTCP("t1", ":1883", nil))
        server.AddHook(new(auth.AllowHook), nil)
        os.MkdirAll("./data", 0755)
        server.AddHook(new(badger.Hook), &badger.Options{Path: "./data"})
        server.Serve()

        sigCh := make(chan os.Signal, 1)
        signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)

        <-sigCh

        server.Close()
}

this is the trace

  2:01PM INF stopping hook hook=allow-all-auth
  2:01PM INF stopping hook hook=badger-db
  2:01PM ERR failed to upsert client data error="Writes are blocked, possibly due to DropAll or Close" data={"clean":true,"id":"conn_0735f94a355d46d9b1","listener":"t1","properties":{"authenticationData":null,"authenticationMethod":"","maximumPacketSize":0,"receiveMaximum":0,"requestProblemInfo":0,"requestProblemInfoFlag":false,"requestResponseInfo":0,"sessionExpiryInterval":0,"sessionExpiryIntervalFlag":false,"topicAliasMaximum":0,"user":null},"protocolVersion":4,"remote":"192.168.56.1:50083","t":"CL","username":null,"will":{"flag":0,"payload":null,"qos":0,"retain":false,"topicName":"","user":null,"willDelayInterval":0}} hook=badger-db
  2:01PM ERR failed to upsert client data error="Writes are blocked, possibly due to DropAll or Close" data={"clean":true,"id":"conn_770e263a612e4a919f","listener":"t1","properties":{"authenticationData":null,"authenticationMethod":"","maximumPacketSize":0,"receiveMaximum":0,"requestProblemInfo":0,"requestProblemInfoFlag":false,"requestResponseInfo":0,"sessionExpiryInterval":0,"sessionExpiryIntervalFlag":false,"topicAliasMaximum":0,"user":null},"protocolVersion":4,"remote":"192.168.56.1:50090","t":"CL","username":null,"will":{"flag":0,"payload":null,"qos":0,"retain":false,"topicName":"","user":null,"willDelayInterval":0}} hook=badger-db
  2:01PM ERR failed to delete client data error="Writes are blocked, possibly due to DropAll or Close" data=conn_e9e94fb9cdf2402d81 hook=badger-db
  2:01PM ERR failed to delete client data error="Writes are blocked, possibly due to DropAll or Close" data=conn_a002bdaba2c242998a hook=badger-db
  2:01PM ERR failed to upsert client data error="Writes are blocked, possibly due to DropAll or Close" data={"clean":true,"id":"conn_4fe9764ad54c4ee587","listener":"t1","properties":{"authenticationData":null,"authenticationMethod":"","maximumPacketSize":0,"receiveMaximum":0,"requestProblemInfo":0,"requestProblemInfoFlag":false,"requestResponseInfo":0,"sessionExpiryInterval":0,"sessionExpiryIntervalFlag":false,"topicAliasMaximum":0,"user":null},"protocolVersion":4,"remote":"192.168.56.1:50082","t":"CL","username":null,"will":{"flag":0,"payload":null,"qos":0,"retain":false,"topicName":"","user":null,"willDelayInterval":0}} hook=badger-db
  2:01PM ERR failed to delete client data error="Writes are blocked, possibly due to DropAll or Close" data=conn_17ea77a48b234097ae hook=badger-db
  2:01PM ERR failed to delete client data error="Writes are blocked, possibly due to DropAll or Close" data=conn_919aa03c251c4ff790 hook=badger-db
  2:01PM ERR failed to delete client data error="Writes are blocked, possibly due to DropAll or Close" data=conn_b2377fadb8d242bd92 hook=badger-db
  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x864725]

  goroutine 1605 [running]:
  github.com/dgraph-io/badger/skl.(*Skiplist).IncrRef(...)
          /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/skl/skl.go:86
  github.com/dgraph-io/badger.(*DB).getMemTables(0xc00021a380)
          /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/db.go:481 +0xe5
  github.com/dgraph-io/badger.(*DB).get(0xc00021a380, {0xc00a0c0000, 0x2c, 0x2c})
          /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/db.go:513 +0x7d
  github.com/dgraph-io/badger.(*Txn).Get(0xc009effa80, {0xc009c45fb0, 0x24, 0x30})
          /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/txn.go:407 +0x247
  github.com/timshannon/badgerhold.(*Store).TxDelete(0x88a67a?, 0xc00007caf0?, {0x9155e0, 0xc00a0b8cb0}, {0x9415c0, 0xc00a09b200})
          /home/dingjun/source/go/pkg/mod/github.com/timshannon/badgerhold@v1.0.0/delete.go:32 +0x118
  github.com/timshannon/badgerhold.(*Store).Delete.func1(0xc00021a380?)
          /home/dingjun/source/go/pkg/mod/github.com/timshannon/badgerhold@v1.0.0/delete.go:17 +0x33
  github.com/dgraph-io/badger.(*DB).Update(0xc00004ac00?, 0xc009b7f338)
          /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/txn.go:696 +0x83
  github.com/timshannon/badgerhold.(*Store).Delete(0xc009cdd95c?, {0x9155e0?, 0xc00a0b8cb0?}, {0x9415c0?, 0xc00a09b200?})
          /home/dingjun/source/go/pkg/mod/github.com/timshannon/badgerhold@v1.0.0/delete.go:16 +0x6f
  github.com/mochi-mqtt/server/v2/hooks/storage/badger.(*Hook).OnDisconnect(0xc000208cc0, 0xc009bf8780, {0x0?, 0x0?}, 0x1)
          /home/dingjun/source/go/pkg/mod/github.com/mochi-mqtt/server/v2@v2.3.0/hooks/storage/badger/badger.go:189 +0x115
  github.com/mochi-mqtt/server/v2.(*Hooks).OnDisconnect(0xc000072420?, 0x9c2c45?, {0x0, 0x0}, 0x0?)
          /home/dingjun/source/go/pkg/mod/github.com/mochi-mqtt/server/v2@v2.3.0/hooks.go:257 +0x13c
  github.com/mochi-mqtt/server/v2.(*Server).attachClient(0xc00007c640, 0xc009bf8780, {0x9bec2d, 0x2})
          /home/dingjun/source/go/pkg/mod/github.com/mochi-mqtt/server/v2@v2.3.0/server.go:381 +0x7c5
  github.com/mochi-mqtt/server/v2.(*Server).EstablishConnection(0x9?, {0x9bec2d, 0x2}, {0xa8daa0?, 0xc0099e3ac0?})
          /home/dingjun/source/go/pkg/mod/github.com/mochi-mqtt/server/v2@v2.3.0/server.go:311 +0x65
  github.com/mochi-mqtt/server/v2/listeners.(*TCP).Serve.func1()
          /home/dingjun/source/go/pkg/mod/github.com/mochi-mqtt/server/v2@v2.3.0/listeners/tcp.go:84 +0x47
  created by github.com/mochi-mqtt/server/v2/listeners.(*TCP).Serve
          /home/dingjun/source/go/pkg/mod/github.com/mochi-mqtt/server/v2@v2.3.0/listeners/tcp.go:83 +0x145
fangdingjun commented 1 year ago

may be this is caused by continue to call hook methods after the hook.Stop() called. after I comment out the h.db.Close on hook.Stop(), the badger not crashed, but the db.Close never called.

may be we should wait all tasks done before we stop the hook.

werbenhu commented 1 year ago

@fangdingjun Just see if this change #299 can fix it, and if it is, I think I should know the reason.

fangdingjun commented 1 year ago

I apply this change to v2.3.0 and add some log like this

if cl.StopCause() == packets.ErrServerShuttingDown {
    h.Log.Error().Msg("OnDisconnect, server is shutting down...")
    return
}

this change has effect, but not fix the issue, still crash on same place.

I noticed there are many errors like this

Writes are blocked, possibly due to DropAll or Close

maybe when we call h.db.Close, the prevous h.db.Delete is stll in progress and cause the problem

here is the log

9:11AM INF mochi mqtt starting version=2.3.0
9:11AM INF mochi mqtt server started
^C9:11AM INF stopping hook hook=allow-all-auth
9:11AM INF stopping hook hook=badger-db
9:11AM ERR failed to delete client data error="Writes are blocked, possibly due to DropAll or Close" data=conn_1fa50dfc8124445bba hook=badger-db
9:11AM ERR failed to upsert client data error="Writes are blocked, possibly due to DropAll or Close" data={"clean":true,"id":"conn_dde13fe3fa4d4d8b81","listener":"t1","properties":{"authenticationData":null,"authenticationMethod":"","maximumPacketSize":0,"receiveMaximum":0,"requestProblemInfo":0,"requestProblemInfoFlag":false,"requestResponseInfo":0,"sessionExpiryInterval":0,"sessionExpiryIntervalFlag":false,"topicAliasMaximum":0,"user":null},"protocolVersion":4,"remote":"192.168.56.1:53190","t":"CL","username":null,"will":{"flag":0,"payload":null,"qos":0,"retain":false,"topicName":"","user":null,"willDelayInterval":0}} hook=badger-db
9:11AM ERR failed to upsert client data error="Writes are blocked, possibly due to DropAll or Close" data={"clean":true,"id":"conn_0467947652764efe88","listener":"t1","properties":{"authenticationData":null,"authenticationMethod":"","maximumPacketSize":0,"receiveMaximum":0,"requestProblemInfo":0,"requestProblemInfoFlag":false,"requestResponseInfo":0,"sessionExpiryInterval":0,"sessionExpiryIntervalFlag":false,"topicAliasMaximum":0,"user":null},"protocolVersion":4,"remote":"192.168.56.1:53192","t":"CL","username":null,"will":{"flag":0,"payload":null,"qos":0,"retain":false,"topicName":"","user":null,"willDelayInterval":0}} hook=badger-db
9:11AM ERR failed to upsert client data error="Writes are blocked, possibly due to DropAll or Close" data={"clean":true,"id":"conn_7d717531c23e455daf","listener":"t1","properties":{"authenticationData":null,"authenticationMethod":"","maximumPacketSize":0,"receiveMaximum":0,"requestProblemInfo":0,"requestProblemInfoFlag":false,"requestResponseInfo":0,"sessionExpiryInterval":0,"sessionExpiryIntervalFlag":false,"topicAliasMaximum":0,"user":null},"protocolVersion":4,"remote":"192.168.56.1:53191","t":"CL","username":null,"will":{"flag":0,"payload":null,"qos":0,"retain":false,"topicName":"","user":null,"willDelayInterval":0}} hook=badger-db
9:11AM ERR failed to upsert client data error="Writes are blocked, possibly due to DropAll or Close" data={"clean":true,"id":"conn_4c0f947f19a442e79c","listener":"t1","properties":{"authenticationData":null,"authenticationMethod":"","maximumPacketSize":0,"receiveMaximum":0,"requestProblemInfo":0,"requestProblemInfoFlag":false,"requestResponseInfo":0,"sessionExpiryInterval":0,"sessionExpiryIntervalFlag":false,"topicAliasMaximum":0,"user":null},"protocolVersion":4,"remote":"192.168.56.1:53193","t":"CL","username":null,"will":{"flag":0,"payload":null,"qos":0,"retain":false,"topicName":"","user":null,"willDelayInterval":0}} hook=badger-db
9:11AM ERR Ondisconnect, server is shutting down... hook=badger-db
9:11AM ERR OnUnsubscribed, server is shutting down... hook=badger-db
9:11AM ERR Ondisconnect, server is shutting down... hook=badger-db
9:11AM ERR OnUnsubscribed, server is shutting down... hook=badger-db
9:11AM ERR Ondisconnect, server is shutting down... hook=badger-db
9:11AM ERR OnUnsubscribed, server is shutting down... hook=badger-db
9:11AM ERR Ondisconnect, server is shutting down... hook=badger-db
9:11AM ERR OnUnsubscribed, server is shutting down... hook=badger-db
9:11AM ERR Ondisconnect, server is shutting down... hook=badger-db
9:11AM ERR OnUnsubscribed, server is shutting down... hook=badger-db
9:11AM ERR Ondisconnect, server is shutting down... hook=badger-db
9:11AM ERR OnUnsubscribed, server is shutting down... hook=badger-db
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x865125]

goroutine 2457 [running]:
github.com/dgraph-io/badger/skl.(*Skiplist).IncrRef(...)
        /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/skl/skl.go:86
github.com/dgraph-io/badger.(*DB).getMemTables(0xc0001d6380)
        /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/db.go:481 +0xe5
github.com/dgraph-io/badger.(*DB).get(0xc0001d6380, {0xc008dd1440, 0x2c, 0x2c})
        /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/db.go:513 +0x7d
github.com/dgraph-io/badger.(*Txn).Get(0xc008a9dc80, {0xc008dd1410, 0x24, 0x30})
        /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/txn.go:407 +0x247
github.com/timshannon/badgerhold.(*Store).TxDelete(0x88b07a?, 0xc0000b4be0?, {0x9155e0, 0xc008cf8e30}, {0x9415c0, 0xc008cd5000})
        /home/dingjun/source/go/pkg/mod/github.com/timshannon/badgerhold@v1.0.0/delete.go:32 +0x118
github.com/timshannon/badgerhold.(*Store).Delete.func1(0xc0001d6380?)
        /home/dingjun/source/go/pkg/mod/github.com/timshannon/badgerhold@v1.0.0/delete.go:17 +0x33
github.com/dgraph-io/badger.(*DB).Update(0xc00004ac00?, 0xc008ca7320)
        /home/dingjun/source/go/pkg/mod/github.com/dgraph-io/badger@v1.6.0/txn.go:696 +0x83
github.com/timshannon/badgerhold.(*Store).Delete(0xc008c1042c?, {0x9155e0?, 0xc008cf8e30?}, {0x9415c0?, 0xc008cd5000?})
        /home/dingjun/source/go/pkg/mod/github.com/timshannon/badgerhold@v1.0.0/delete.go:16 +0x6f
github.com/mochi-mqtt/server/v2/hooks/storage/badger.(*Hook).OnDisconnect(0xc0001c4cc0, 0xc00865ac80, {0x0?, 0x0?}, 0x1)
        /home/dingjun/source/go/src/github.com/mochi-mqtt/server/hooks/storage/badger/badger.go:201 +0x186
github.com/mochi-mqtt/server/v2.(*Hooks).OnDisconnect(0xc0000a23c0?, 0x9c2c45?, {0x0, 0x0}, 0x0?)
        /home/dingjun/source/go/src/github.com/mochi-mqtt/server/hooks.go:257 +0x13c
github.com/mochi-mqtt/server/v2.(*Server).attachClient(0xc0000b45f0, 0xc00865ac80, {0x9bec2d, 0x2})
        /home/dingjun/source/go/src/github.com/mochi-mqtt/server/server.go:381 +0x7c5
github.com/mochi-mqtt/server/v2.(*Server).EstablishConnection(0x0?, {0x9bec2d, 0x2}, {0xa8dbe0?, 0xc008afe4b0?})
        /home/dingjun/source/go/src/github.com/mochi-mqtt/server/server.go:311 +0x65
github.com/mochi-mqtt/server/v2/listeners.(*TCP).Serve.func1()
        /home/dingjun/source/go/src/github.com/mochi-mqtt/server/listeners/tcp.go:85 +0x47
created by github.com/mochi-mqtt/server/v2/listeners.(*TCP).Serve
        /home/dingjun/source/go/src/github.com/mochi-mqtt/server/listeners/tcp.go:84 +0x145
werbenhu commented 1 year ago

@fangdingjun I made some changes in #299 , see if they are effective.

fangdingjun commented 1 year ago

with latest commit, work fine with fresh start, the server not crash on close.

but with prevous commit's badger data store, the server can not be closed, can only be ended by kill -9, I don't know if I do something wrong.

werbenhu commented 1 year ago

@fangdingjun you mean the 'crash on close' issue has been resolved, but now the program cannot exit?

werbenhu commented 1 year ago

@fangdingjun I conducted some tests:

  1. Exiting the program using Ctrl+C while establishing 2000 connections.
  2. Restarting the program using the previous badger data.
  3. Exiting the program again while establishing 2000 connections.
  4. Repeating steps 2 and 3.

However, I did not encounter the issue of the program being unable to exit. Perhaps you can provide more details about your testing, preferably with steps that can reproduce this issue.

fangdingjun commented 1 year ago

I can reproduce it by this step

  1. use v2.3.0 with badger store, connect some clients run for a while, call server.Close let it panic
  2. switch to latest commit version, use same data store, start the server, call server.Close, server will hang up
  3. switch back to v2.3.0 use same data store, do not connect clients, server.Close will end normal

if the data store not closed properly, with latest commit the server.Close unable to exit

werbenhu commented 1 year ago

@fangdingjun Thanks, I was able to reproduce this issue, and I've made some changes #299 to fix it. Can you please check if it's effective?

fangdingjun commented 1 year ago

with latest commit, the "crash on close" issue is back.

werbenhu commented 1 year ago

@fangdingjun Could you please provide the crash logs? I merged my changes into both v2.4.0 and v2.3.0, and I tested them using the steps you mentioned, and everything was OK. Did you merge my changes into your v2.3.0 codebase? Is there anything missing?

You can directly pull my repository at werbenhu/mqserver for testing.

If you want to test the scenario after the crash in v2.3.0, just copy the 'cmd/data' directory from your crashed v2.3.0 into the 'cmd' directory of v2.4.0.

In the context of the crash you experienced in v2.3.0, I suggest checking if the port you are currently listening on is still in use. You can add the following check in your main.go file.

err := server.AddListener(listeners.NewTCP("t1", ":1883", nil))
    if err != nil {
        panic(err)
    }
fangdingjun commented 1 year ago

301 solve this issue,

299 in some condition can still cause panic.

mochi-co commented 1 year ago

Resolved in v2.4.1, sorry for the delay!