vmware / concord-bft

Concord byzantine fault tolerant state machine replication library
379 stars 149 forks source link

Remove the CRE (possibly infinite) loop from the end of state transfer. #2995

Open yontyon opened 1 year ago

yontyon commented 1 year ago

After carefully analyzing the problem, it appears we can have the two mechanisms as long as we sync between them.

We have two options:

  1. the replica was able to get the reconfiguration update prior to the end of state transfer. In this case, the replica will get the new configuration and start the state transfer all over again (because state transfer has not been finished). Then, once we actually finish state transfer in the new configuration, we won't execute the scale command again as it was written to the configurations file.
  2. The replica was not able to get the update with CRE and was able to complete state transfer. Then we simply execute the reconfiguration update as any other reconfiguration command. As before, we update the configurations file with the new configuration.

Notes:

  1. Removing the synchronization phase (the loop), means that every new reconfiguration command should be developed w.r.t to the above problem (there is no generic solution)
  2. The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.
eyalrund commented 1 year ago

looks risky. We need to test State transfer scenarios

yontyon commented 1 year ago

looks risky. We need to test State transfer scenarios

There are Apollo tests that test that. But scheduling a deterministic test for testing this scenario is not supported at the moment

glevkovich commented 1 year ago

The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.

@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP?

From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.

glevkovich commented 1 year ago

@yontyon

Are current tests good enough, or do we need to add new tests to check new side cases due to current modifications?

yontyon commented 1 year ago

The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.

@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP?

From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.

@glevkovich , I'm referring to this piece of code:

  // Mark ST final completion to "outside world" - After this call any incoming messages are not dropped.
  setExternalCollectingState(false);

  // Now after external state have changed, invoke the 2nd group of registered callbacks
  // TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.
  // worst case we will have some few incoming messages dropped, until key-exchange is done.
  // on_fetching_state_change_cb_registry_ should be removed
  auto size = on_fetching_state_change_cb_registry_.size();
  LOG_INFO(
      logger_,
      "Starting to invoke all registered calls (on_fetching_state_change_cb_registry_):" << KVLOG(size, checkpointNum));
  on_fetching_state_change_cb_registry_.invokeAll(checkpointNum);
  LOG_INFO(logger_, "Done invoking all registered calls (on_fetching_state_change_cb_registry_)");

AFAIU, if the replica has crashed during on_fetching_state_change_cb_registry_ then it won't execute again the callbacks (because state transfer has already finished).

In addition, it's rather a general warning: Once you rely on two different mechanisms and one of them is implemented on the state transfer callback, the developer has to have in mind that its logic might be executed twice (in case the replica has crashed during the callback execution). Until now there was a general solution (the CRE loop), but now it is the developer's responsibility to handle all these edge cases.

yontyon commented 1 year ago

@yontyon

Are current tests good enough, or do we need to add new tests to check new side cases due to current modifications?

To test the specific scenario, we need to somehow be able to exactly to crash a replica exactly where the loop was before and before its CRE was able to fetch the update. AFAIN, we don't have this capability at the moment. So good tests (IMO) are simply to see that the relevant tests do not crash (this PR run for 4 times already and all was good)

teoparvanov commented 1 year ago

Hi @yontyon, can you please try the following (currently disabled) test with your PR? https://github.com/vmware/concord-bft/blob/master/tests/apollo/test_skvbc_restart_recovery.py#L495

I'd suggest to run it in a loop, as a confirmation that your changes here address the underlying test instability.

Thanks!

glevkovich commented 1 year ago

The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR. @yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP? From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.

@glevkovich , I'm referring to this piece of code:

  // Mark ST final completion to "outside world" - After this call any incoming messages are not dropped.
  setExternalCollectingState(false);

  // Now after external state have changed, invoke the 2nd group of registered callbacks
  // TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.
  // worst case we will have some few incoming messages dropped, until key-exchange is done.
  // on_fetching_state_change_cb_registry_ should be removed
  auto size = on_fetching_state_change_cb_registry_.size();
  LOG_INFO(
      logger_,
      "Starting to invoke all registered calls (on_fetching_state_change_cb_registry_):" << KVLOG(size, checkpointNum));
  on_fetching_state_change_cb_registry_.invokeAll(checkpointNum);
  LOG_INFO(logger_, "Done invoking all registered calls (on_fetching_state_change_cb_registry_)");

AFAIU, if the replica has crashed during on_fetching_state_change_cb_registry_ then it won't execute again the callbacks (because state transfer has already finished).

In addition, it's rather a general warning: Once you rely on two different mechanisms and one of them is implemented on the state transfer callback, the developer has to have in mind that its logic might be executed twice (in case the replica has crashed during the callback execution). Until now there was a general solution (the CRE loop), but now it is the developer's responsibility to handle all these edge cases.

@yontyon Yes, ST is done at this stage. I wonder if we still need this callback here? After all, it is not part of ST (ST is done alrrady). In current master code i've added a TODO: // TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.