ledger data become inconsistent across nodes after restarting a peer (using PBFT batch)

sachikoy commented 8 years ago

I am testing on an environment with 4 VPs (vp0, .. vp3) with PBFT batch and security enabled. When I stop one peer and restart it, the peer's state becomes inconsistent with other peers, yet all peers keep making consensus.

Here is the precise steps I did:

invoke several transactions on all of peers (vp0, .., vp3)
stop vp3, and make sure other nodes can still process transactions
start vp3, and make sure that vp3 can process transactions (and all other peers)
then vp3's the block height (from /chain REST API endpoint) is different from other peers.
If I query the state, vp3's result is different from other peers.

cophey commented 8 years ago

@cca88 As you said,"We understand the need to synchronize all nodes after more time passes, but this is not yet implemented." But when does all nodes synchronize? I am very confused how to synchronize all nodes and when to synchronize? I did like @sachikoy.At first,the data is not consistent.When i stoped peer2 or peer3 some times,and deploy and invoke chaincode some times,and I restart peers, and all peers are consistent.Here is the log:

07:20:52.442 [peer] handleChat -> ERRO 082 Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp1" 172.0.0.55:30303 VALIDATOR } 07:21:12.117 [consensus/util] RegisterChannel -> WARN 083 Received duplicate connection from <nil>, switching to new connection 07:22:34.028 [consensus/obcpbft] recvViewChange -> INFO 084 Replica 3 received view-change from replica 1, v:1, h:10, |C|:1, |P|:6, |Q|:6 07:22:34.108 [consensus/obcpbft] recvViewChange -> INFO 085 Replica 3 received view-change from replica 4, v:1, h:10, |C|:1, |P|:6, |Q|:6 07:22:36.028 [consensus/obcpbft] processEvent -> INFO 086 Replica 3 view change timer expired, sending view change 07:22:36.028 [consensus/obcpbft] sendViewChange -> INFO 087 Replica 3 sending view-change, v:2, h:0, |C|:1, |P|:15, |Q|:16 07:22:36.029 [consensus/obcpbft] recvViewChange -> INFO 088 Replica 3 received view-change from replica 3, v:2, h:0, |C|:1, |P|:15, |Q|:16 07:22:36.029 [consensus/obcpbft] loop -> WARN 089 Attempting to stop an unfired idle timer 07:22:36.029 [consensus/obcpbft] recvViewChange -> INFO 08a Replica 3 received view-change from replica 0, v:2, h:0, |C|:2, |P|:16, |Q|:17 07:22:36.030 [consensus/obcpbft] recvViewChange -> INFO 08b Replica 3 received view-change from replica 2, v:2, h:0, |C|:1, |P|:15, |Q|:16 07:22:36.032 [consensus/obcpbft] recvViewChange -> INFO 08c Replica 3 received view-change from replica 1, v:2, h:10, |C|:1, |P|:6, |Q|:6 07:22:36.033 [consensus/obcpbft] recvViewChange -> INFO 08d Replica 3 received view-change from replica 4, v:2, h:10, |C|:1, |P|:6, |Q|:6 07:22:36.038 [consensus/obcpbft] recvNewView -> INFO 08e Replica 3 received new-view 2 07:22:36.039 [consensus/obcpbft] processNewView -> WARN 08f Replica 3 missing base checkpoint 10 (CAsSQO3LcF3sA1A+77mumM+7mvhGY6nHNVILPmp6P4PARCIK+clPn9K3Pi/lQ1tAUTkXY56YplcIuW6SZPiIVHywU6AaQNYhUlbq9E31Ot4mGTdb06KEgTViJ9/ZFONpBC4fElO/YTEEndAhYEctz/+MMC/HdUHzF+qkf0PIyAXIdmNSXPo=) 07:22:36.040 [consensus/handler] SkipTo -> WARN 090 State transfer is being called for, but the state has not been invalidated 07:22:36.040 [consensus/obcpbft] processNewView2 -> INFO 091 Replica 3 accepting new-view to view 2 07:22:36.341 [consensus/obcpbft] loop -> WARN 092 Attempting to stop an unfired idle timer 07:22:36.342 [consensus/obcpbft] loop -> WARN 093 Attempting to stop an unfired idle timer 07:22:36.343 [consensus/obcpbft] loop -> WARN 094 Attempting to stop an unfired idle timer 07:22:36.345 [consensus/obcpbft] loop -> WARN 095 Attempting to stop an unfired idle timer 07:22:36.346 [consensus/obcpbft] loop -> WARN 096 Attempting to stop an unfired idle timer 07:22:36.348 [consensus/obcpbft] loop -> WARN 097 Attempting to stop an unfired idle timer 07:22:36.349 [consensus/obcpbft] loop -> WARN 098 Attempting to stop an unfired idle timer 07:22:36.350 [consensus/obcpbft] loop -> WARN 099 Attempting to stop an unfired idle timer 07:22:37.554 [consensus/obcpbft] processEvent -> INFO 09a Replica 3 application caught up via state transfer, lastExec now 10 07:22:37.555 [consensus/obcpbft] executeOne -> INFO 09b Replica 3 executing/committing request for view=2/seqNo=11 and digest VNgkd+K5LG9ZXm4SowbIfSky4irYa7gvRqy7Uv0YOPsCsTfQRSsCOWBV9gn46RR6RH8lmY2CUeCH0JW9dD/ECg== 07:22:55.341 [consensus/obcpbft] execDoneSync -> INFO 09c Replica 3 finished execution 11, trying next 07:22:55.342 [consensus/obcpbft] executeOne -> INFO 09d Replica 3 executing/committing request for view=2/seqNo=12 and digest wjOnFNHPJElPfUnULjFwotEvZ103D5L9mkSe0kNPjHxXUEQ3sqFwjZMISKDJAJYH4tiz4eg168xXoOvXRbOxzw== 07:22:55.346 [consensus/obcpbft] execDoneSync -> INFO 09e Replica 3 finished execution 12, trying next 07:22:55.347 [consensus/obcpbft] executeOne -> INFO 09f Replica 3 executing/committing request for view=2/seqNo=13 and digest 9LZq13v9biLJgEGNVb9yrHCh8NBw3n6KqR4giL66sxq4jqjr1MVXOKZSGS6fioUNqmU7hODkTtWCNsDev6k/xA== 07:22:55.888 [consensus/obcpbft] execDoneSync -> INFO 0a0 Replica 3 finished execution 13, trying next 07:22:55.888 [consensus/obcpbft] executeOne -> INFO 0a1 Replica 3 executing/committing request for view=2/seqNo=14 and digest GCK9aJ7UgbH4Fq0aF25NRszCZ8JDFEsjrnis+KOh65FMSS3GxexOu0fVXbZrN4InqPkPl9FTsQpBLtSW9uoX3w== 07:23:07.285 [consensus/obcpbft] execDoneSync -> INFO 0a2 Replica 3 finished execution 14, trying next 07:23:07.285 [consensus/obcpbft] executeOne -> INFO 0a3 Replica 3 executing/committing request for view=2/seqNo=15 and digest kqP4gKcNvepiStEF5bJa8M4/7jqj/Gc4OuWY1IQYOi5ACDdKTchjcmR3xEFHpC+3rbUb3v0ziALKo3yVVYDJFA== 07:23:26.229 [consensus/obcpbft] execDoneSync -> INFO 0a4 Replica 3 finished execution 15, trying next 07:23:26.229 [consensus/obcpbft] executeOne -> INFO 0a5 Replica 3 executing/committing request for view=2/seqNo=16 and digest CpcJQd1nxUMunUAKytSbryWKOcqXXuK552qyM9Y6rI1EvXXhptMPEJd1BXrfRqEcJsjE8+seZM/MB8UDNOFXrw== 07:23:27.711 [consensus/obcpbft] execDoneSync -> INFO 0a6 Replica 3 finished execution 16, trying next 07:23:27.711 [consensus/obcpbft] executeOne -> INFO 0a7 Replica 3 executing/committing request for view=2/seqNo=17 and digest em1uHEGa5nS+1mamx89vMRIsmu2RM+b3kEqZhJqtIkcuB/JE5YDEI3ZzdHFixuxO5vXa1SxfnQh2ccCNrleUBg== 07:23:28.565 [consensus/obcpbft] execDoneSync -> INFO 0a8 Replica 3 finished execution 17, trying next 07:23:28.568 [consensus/obcpbft] executeOne -> INFO 0a9 Replica 3 executing/committing request for view=2/seqNo=18 and digest ptvCPHdlSxwbhE9urbjVU3ny0iXT45B/tk7BdJQkeAi9k8IhjV8GmExJCJHYLOtpsvfRL6UJuEJLkc/hX+8nDQ== 07:23:28.577 [consensus/obcpbft] execDoneSync -> INFO 0aa Replica 3 finished execution 18, trying next

cophey commented 8 years ago

I do just like what I did before,but the peers can not synchronize again. And I synchronize all peers only one time.

corecode commented 8 years ago

@cophey Please clearly explain your expectations and what you observe.

cophey commented 8 years ago

@corecode I am sorry that my description is not clear.

Here is the precise steps I did:

I start 5 VPs (vp0, .. vp4) with CORE_PEER_VALIDATOR_CONSENSUS_PLUGIN=pbft and CORE_PBFT_GENERAL_N=5.
invoke several transactions on all of peers (vp0, .., vp4),and all peers are consistent.
stop vp3, and invoke several transactions on other peers,and all peers(except vp3) are consensus.
start vp3, and find that vp3's block height is not the same with others.
stop vp2, and invoke several transactions on vp3,and find that vp3 does not record ledger and others(except vp2,vp3) are consistent and the block height is increased.
start vp2, and find that vp2 and vp3 are inconsistent with others.
this time all peers are working but vp2 and vp3 are not consistent.I invoke several transactions on all of peers,after about 10 transactions or after the time,vp4 stops itself and i restart vp4 again.then some of peers stops itself,i restart it again.
then i restart all peers, and all peers are consistent and it logs above.

I don't know why all peers are consistent at last.And I repeat the above steps,but the result is different that peers are not consistent anyhow.It makes me confused.When a peer catches up?

in addition,in the steps,the logs appears this warnings: 01:01:53.606 [consensus/util] RegisterChannel -> INFO 01e Registering connection from <nil> 01:01:56.328 [consensus/util] RegisterChannel -> WARN 01f Received duplicate connection from <nil>, switching to new connection 01:02:19.612 [consensus/util] RegisterChannel -> WARN 020 Received duplicate connection from <nil>, switching to new connection 01:02:20.525 [consensus/util] RegisterChannel -> WARN 021 Received duplicate connection from <nil>, switching to new connection 01:02:20.541 [peer] handleChat -> ERRO 022 Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp3" 172.0.0.52:30303 VALIDATOR } 01:03:39.616 [consensus/util] RegisterChannel -> WARN 023 Received duplicate connection from <nil>, switching to new connection 01:03:40.521 [consensus/util] RegisterChannel -> WARN 024 Received duplicate connection from <nil>, switching to new connection 01:03:40.522 [peer] handleChat -> ERRO 025 Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp4" 172.0.0.43:30303 VALIDATOR }

kchristidis commented 8 years ago

@cophey please attach a zip file with the logs from all peers. In step 4, can you define how many transactions you invoke and how much time approximately you spend in that step?

cophey commented 8 years ago

@kchristidis I am sorry that i cannot recur it.I just synchronize all peers successfully only one time and have no logs.so sorry ablout that.

in step 4,i invoke about 6 times.It spends ablout one or two minutes.I am so sorry I remember not very clear.

anyway,what i want to do is:

when one peer restarts, it can catch up ledger,no matter what time.
when a new peer is added, it can catch up ledger,no matter what time. what should i do?

christo4ferris commented 8 years ago

@sachikoy this thread has gone silent... is it still an issue? Do we have tests that can assure expected behavior?

hyperledger-archives / fabric

ledger data become inconsistent across nodes after restarting a peer (using PBFT batch) #1331