imzhenyu / rDSN

Robust Distributed System Nucleus (rDSN) is an open framework for quickly building and managing high performance and robust distributed systems.
MIT License
33 stars 11 forks source link

leave replica in error state after incomplete state during loading #536

Closed imzhenyu closed 8 years ago

imzhenyu commented 8 years ago

Fix issue https://github.com/imzhenyu/rDSN.dist.service/issues/13

imzhenyu commented 8 years ago

@shengofsun @qinzuoyan please take a look and have a test if possible. The idea is to delete the replica whenever this is an error during loading.

shengofsun commented 8 years ago

I think a replica is very likely to crash during the checkpoint, so an error state will lead to large amount of data to transfer. As far as I can see, the fix looks more like hidding a bug than fixing a bug.

imzhenyu commented 8 years ago

Why very likely?

imzhenyu commented 8 years ago

If it happens often, we should surely handle it carefully. But if it happens rarely, we can handle it in a simple way that may not be the most efficient but simple and correct.

shengofsun commented 8 years ago

Learning happens not only because of machine failure, but also because of load balancer and (potential) secondaries times out. Taking this into account, I don't think a process crash with generating checkpoint is rare. The root cause is the learned logs are abandoned after applied in the learner, so what about trying to keep these logs?

imzhenyu commented 8 years ago

Learning is definitely not rare, but learning failure, or learning failure between update_init_info and checkpoint ,should be rare. On the other hand, keeping the learned logs and/or checkpoint for each learning round to ensure the state is contiguous on disk, is doable but too costly. Finally, even the disk state may fail due to certain issues (e.g., human misops). We therefore can simplify the way of handling this problem by removing the whole state if error happens during load. Later on, we can add some optimization by keeping the contiguous part of the hard-state to reduce the later learning cost, which, however, should be done only when we do have evidence that this becomes a big issue.

shengofsun commented 8 years ago

I may have misunderstand in "update_init_info", but what is it for?

imzhenyu commented 8 years ago

update_init_info truncates the logs. Usually we need to do the checkpoint first before update_init_info. But in learning, we don't want to do checkpoint in replication there, therefore has an inversed order problem which introduces incomplete on-disk state. See the following lines of code in replica_learn.cpp.

// reset log positions for later mutations
        // WARNING: it still requires checkpoint operation in later 
        // on_copy_remote_state_completed to ensure the state is completed
        // if there is a failure in between, our checking
        // during app::open_internal will invalidate the logs
        // appended by the mutations AFTER current position
        err = _app->update_init_info(
            this,
            _stub->_log->on_partition_reset(get_gpid(), _app->last_committed_decree()),
            _private_log->on_partition_reset(get_gpid(), _app->last_committed_decree()),
            _app->last_committed_decree()
            );
shengofsun commented 8 years ago

What I mean is why we need to truncate the log after learning? Can't the mutations just append to the old log files, As the coming mutations should with larger (ballot, decree)? I think we can just move the learned log to the plog directory if no truncate.

imzhenyu commented 8 years ago

Because we did not do that due to performance consideration, see apply_learned_state_from_private_log.

imzhenyu commented 8 years ago

@shengofsun Let's try to merge this for now and improve it later with better solutions in the future.