lni / dragonboat

A feature complete and high performance multi-group Raft library in Go.
Apache License 2.0
4.99k stars 534 forks source link

Snapshot save error #228

Open uber42 opened 2 years ago

uber42 commented 2 years ago
panic: /home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3 doesn't exist when creating /home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3/snapshot-00000000000003E9-3.generating

goroutine 350 [running]:
github.com/lni/dragonboat/v3/internal/fileutil.Mkdir({0xc004358280, 0x97}, {0x1639918, 0x1d79520})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/internal/fileutil/utils.go:122 +0x2dc
github.com/lni/dragonboat/v3/internal/server.(*SSEnv).createDir(0xc01f9486f0, {0xc004358280, 0x97})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/internal/server/snapshotenv.go:251 +0x86
github.com/lni/dragonboat/v3/internal/server.(*SSEnv).CreateTempDir(0xc01f9486f0)
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/internal/server/snapshotenv.go:200 +0x45
github.com/lni/dragonboat/v3.(*snapshotter).Save(_, {_, _}, {0x3, 0x3e9, 0x169, 0x3e9, {0x0, 0x0, {0x0, ...}, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/snapshotter.go:104 +0x125
github.com/lni/dragonboat/v3/internal/rsm.(*StateMachine).doSave(_, {0x3, 0x3e9, 0x169, 0x3e9, {0x0, 0x0, {0x0, 0x0}, 0x0, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/internal/rsm/statemachine.go:802 +0x193
github.com/lni/dragonboat/v3/internal/rsm.(*StateMachine).concurrentSave(_, {_, _, {_, _}, _, _})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/internal/rsm/statemachine.go:758 +0x358
github.com/lni/dragonboat/v3/internal/rsm.(*StateMachine).Save(_, {_, _, {_, _}, _, _})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/internal/rsm/statemachine.go:509 +0x2a5
github.com/lni/dragonboat/v3.(*node).doSave(0xc000420800, {0x0, 0x0, {0x0, 0x0}, 0x0, 0x0})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/node.go:705 +0x2d6
github.com/lni/dragonboat/v3.(*node).save(0xc000420800, {0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/node.go:684 +0x7b
github.com/lni/dragonboat/v3.(*ssWorker).save(0xc0003a9f60, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/engine.go:296 +0x78
github.com/lni/dragonboat/v3.(*ssWorker).handle(0xc0003a9f60, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/engine.go:279 +0xba
github.com/lni/dragonboat/v3.(*ssWorker).workerMain(0xc0003a9f60)
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/engine.go:265 +0x1bb
github.com/lni/dragonboat/v3.newSSWorker.func1()
        /home/user/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.1/engine.go:251 +0x25
github.com/lni/goutils/syncutil.(*Stopper).runWorker.func1()
        /home/user/go/pkg/mod/github.com/lni/goutils@v1.3.0/syncutil/stopper.go:79 +0x173
created by github.com/lni/goutils/syncutil.(*Stopper).runWorker
        /home/user/go/pkg/mod/github.com/lni/goutils@v1.3.0/syncutil/stopper.go:74 +0x133

Dragonboat version

v3.3.1

Steps to reproduce the behavior

Couldn't reproduce again

lni commented 2 years ago

hi @uber42 , thanks for reporting the above issue.

Could you please confirm what filesystem was used? It is a local file system or some networked file system like NFS?

uber42 commented 2 years ago

hi, I use ext4

lni commented 2 years ago

@uber42 thanks for the info.

As you can see from the error log -

/home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3 doesn't exist when creating /home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3/snapshot-00000000000003E9-3.generating

the dir "/home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3" is missing when a new snapshot is about to be created inside it.

this dir is created when the node is started in NodeHost.startCluster(). I don't think there is any code that would delete the dir.

any chance that it might be deleted by some of your code?

uber42 commented 2 years ago

The root raft directory cannot be deleted by our code. This result was obtained while testing our project with various fault injections, including network partition between nodes. Perhaps a change of leader may appear such behavior. Logs unfortunately lost :(

lni commented 2 years ago

@uber42 thanks for the info.

I have the feeling that this issue is highly unlikely to be caused by Dragonboat's code. If you check the source code, node's snapshot dir is never deleted, dragonboat only deletes whats in the directory. Large scale fault injection tests are a part of dragonboat's development process for years, it was fine in all those tests.

Could you please try to re-run your tests and provide the full log when you can reproduce the issue? Really want to help you to get to the bottom of this. Thanks.

uber42 commented 2 years ago

I will try to reproduce, but so far this is an isolated case for a very large number of tests.

lni commented 2 years ago

@uber42 did you manage to get this reproduced?