sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.57k stars 1.14k forks source link

Creation of Snapshot fails #707

Open samirvb opened 2 years ago

samirvb commented 2 years ago

Your question

On one of my existing nodes , the snapshot creation fails with the following exception stacktrace :

2021-11-05 17:56:29.200 [ ] [JRaft-Closure-Executor-4] [init-64] ERROR c.a.s.j.s.s.l.LocalSnapshotWriter - Fail to create directory /node/data//sofajraft/stacs/snapshot/temp. 2021-11-05 17:56:29.201 [ ] [JRaft-Closure-Executor-4] [create-285] ERROR c.a.s.j.s.s.l.LocalSnapshotStorage - Fail to init snapshot writer. 2021-11-05 17:56:29.202 [ ] [JRaft-FSMCaller-Disruptor-0] [onError-72] ERROR c.a.s.j.c.StateMachineAdapter - Encountered an error=Status[EIO<1014>: Fail to create snapshot writer.] on StateMachine io.stacs.nav.consensus.sofajraft.config.SofajraftStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node. com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_SNAPSHOT at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:691) at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.doSnapshot(SnapshotExecutorImpl.java:346) at com.alipay.sofa.jraft.core.NodeImpl.doSnapshot(NodeImpl.java:3098) at com.alipay.sofa.jraft.core.NodeImpl.lambda$handleSnapshotTimeout$0(NodeImpl.java:607) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

Note that the location already has snapshot folders so it's not an issue with permissions. Also there is no issue with disk space. Any idea what might be happening ? This error occurs on different nodes which have been running fine using sofajraft 1.3.5.

Your scenes

Describe your use scenes (why need this feature)

Your advice

Describe the advice or solution you'd like

Environment

fengjiachun commented 2 years ago

/node/data//sofajraft/stacs/snapshot/temp exists and is not a directory?

samirvb commented 2 years ago

/node/data//sofajraft/stacs/snapshot/temp exists and is not a directory?

Yes , this location doesn't exist nor as a directory or as a file.

fengjiachun commented 2 years ago

Can you show the ls -lsh result for: /stacs/snapshot/

samirvb commented 2 years ago

Unfortunately I don't have the old node (since we had to restore it). I had done a "ls -la" on the location and found no other "temp" folder/file in that location. Attached is a screenshot of the restored node -

image

Is there anyway we can reproduce this issue ? This is quite important and our cluster goes down so we need to fix it.

fengjiachun commented 2 years ago

Most likely it was permission issue, but the logger did not print the exception message, I fixed the log in this #708

killme2008 commented 2 years ago

I think it's a permission problem here , what's the user do you run the java program? In above screenshot, the snapshot directory belongs to root user.

samirvb commented 2 years ago

I think it's a permission problem here , what's the user do you run the java program? In above screenshot, the snapshot directory belongs to root user.

Hi , all processes are run using the root user. See below screenshot :

image

The process runs using the "root" user I was able to create a directory in the same location using the mkdir command and was able to create it.

Can you let me know if there is any way we can reproduce the creation of snapshot (and hopefully this issue) ?

fengjiachun commented 2 years ago

Only one directory was created and nothing else was done, so we couldn't find a good way to reproduce it.

killme2008 commented 2 years ago

We will release a new version with more logs, and if it reproduces in future, we can find out the root cause.