sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.52k stars 1.12k forks source link

独立部署pd,操作addReplica提示:Fail to [addReplica], Status[ECATCHUP<10003>: Peer 127.0.0.1:8182 failed to catch up.]. #1078

Closed zxuanhong closed 4 months ago

zxuanhong commented 4 months ago

Your question

  1. 通过pd添加分区副本,但是失败不成功,会提示failed to catch up。
  2. 触发了添加peer.但是又下线了( 新添加的peer 8182在另外一个分区是正常的。而且telnet 8182端口也是通的)
    
    2024-03-06 06:49:01  INFO 93561 --- [flow-demo] [rpc-executor #4] com.alipay.sofa.jraft.core.NodeImpl      : Adding learners: [].
    2024-03-06 06:49:01  INFO 93561 --- [flow-demo] [rpc-executor #4] com.alipay.sofa.jraft.core.NodeImpl      : Adding peers: [127.0.0.1:8182].
    2024-03-06 06:49:01  INFO 93561 --- [flow-demo] [rpc-executor #4] com.alipay.sofa.jraft.core.Replicator    : Replicator [group: pd_test-2, peer: 127.0.0.1:8182, type: Follower] is started
    2024-03-06 06:49:01  WARN 93561 --- [flow-demo] [es-Thread-Send4] com.alipay.sofa.jraft.core.Replicator    : Fail to issue RPC to 127.0.0.1:8182, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 127.0.0.1:8182, group: pd_test-2], groupId=pd_test-2
    2024-03-06 06:49:01  WARN 93561 --- [flow-demo] [es-Thread-Send4] com.alipay.sofa.jraft.core.Replicator    : Fail to issue RPC to 127.0.0.1:8182, consecutiveErrorTimes=11, error=Status[ENOENT<1012>: Peer id not found: 127.0.0.1:8182, group: pd_test-2], groupId=pd_test-2
    2024-03-06 06:49:02  WARN 93561 --- [flow-demo] [ure-Executor-11] com.alipay.sofa.jraft.core.NodeImpl      : Node <pd_test-2/127.0.0.1:8181> caughtUp failed, status=Status[ETIMEDOUT<1010>: ETIMEDOUT], peer=127.0.0.1:8182.
    2024-03-06 06:49:02  WARN 93561 --- [flow-demo] [ure-Executor-11] com.alipay.sofa.jraft.core.NodeImpl      : Node <pd_test-2/127.0.0.1:8181> fail to catch up peer 127.0.0.1:8182 when trying to change peers from [127.0.0.1:8181, 127.0.0.1:8183] to [127.0.0.1:8181, 127.0.0.1:8183, 127.0.0.1:8182].
    2024-03-06 06:49:02  INFO 93561 --- [flow-demo] [ure-Executor-11] c.a.sofa.jraft.core.ReplicatorGroupImpl  : Stop replicator to 127.0.0.1:8182, group id pd_test-2.

3. 后面又测试了下,如果集群初始化分区副本节点存在,启动成功后先操作removeReplica节点。然后又addReplica居然是正常的。难道分区副本必须启动时全部设置好,不应该是能动态增减吗???

### Environment

- SOFAJRaft version: 1.3.14
- JVM version (e.g. `java -version`): 1.8
- OS version (e.g. `uname -a`): mac
- Maven version: 3.9.5
- IDE version: idea 2023.3.4
fengjiachun commented 4 months ago

failed to catch up

raft 层面追日志失败,具体原因可能要看日志,有可能是超时了,添加一个副本要做的操作是从 leader 复制 Snapshot 文件,并追赶 leader 的日志,如果没有开启 Snapshot,那么就需要从头开始追赶日志,多数情况下,添加一个副本无法很快的完成