Open shengofsun opened 8 years ago
@shengofsun This has been considered before and it happens when the checkpoint is done or an existing checkpint somehow is missing or with wrong state. In that case, we have INCOMPLETE_STATE error for app::open_internal. It seems that our handling in replica::init_app_and_prepare_list and replica_stub::initialize may have some issues (e.g., forgot clearing out the private log or something). You may want to have a check. Generating checkpoint in every stage won't fix this issue.
@imzhenyu , surely there is some problem on handling of INCOMPLETE_STATE, would you have time to review and fix the code together with us?
Sure. I'll work on it then and you guys may need to help do the review and test later. Should be done today.
@imzhenyu generating checkpoint in every stage fix this, as the log will be continuous when then learner starts to write mutations to log when learning from cache. Or else we need to abandon the extra mutations when initializing, which is not easy.
BTW, I'm not quite understand why the log need to be truncated by the "init_info". Is it only an optimization or a necessity for correctness?
being continuous
is not easy when certain failures happen (e.g., disk failure), even you generate checkpoints at every stage. So in any case we have to handle incomplete data, which invalidates the later private or shared logs. A simple way as the above prs simply clears out the replicas whenever there is an error during load. Previously I want to do certain optimizations which introduces some problems.
0 0x00007ff01fc385c9 in raise () from /lib64/libc.so.6
1 0x00007ff01fc39cd8 in abort () from /lib64/libc.so.6
2 0x00007ff020f1c5af in dsn_coredump () at /home/work/pegasus/rDSN/src/core/core/service_api_c.cpp:307
3 0x00007ff01ef76259 in dsn::replication::prepare_list::commit (this=0x7fef3c0da440, d=27651077, ct=dsn::replication::COMMIT_TO_DECREE_HARD)
4 0x00007ff01ef75ea9 in dsn::replication::prepare_list::prepare (this=0x7fef3c0da440, mu=..., status=dsn::replication::partition_status::PS_INACTIVE)
5 0x00007ff01ef4672c in dsn::replication::replica::replay_mutation (this=0x7fef3c0aa1a0, mu=..., is_private=false) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_init.cpp:460
6 0x00007ff01efc1140 in dsn::replication::replica_stub::lambda12::operator() (closure=0x7fef5c81efa0, mu=...) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_stub.cpp:253
7 0x00007ff01efcaf9a in std::_Function_handler<bool(dsn::ref_ptr&), dsn::replication::replica_stub::initialize(const dsn::replication::replication_options&, bool)::__lambda12>::_M_invoke(const std::_Any_data &, dsn::ref_ptrdsn::replication::mutation &) (functor=..., args#0=...) at /usr/include/c++/4.8.2/functional:2057
8 0x00007ff01effd3a9 in std::function<bool (dsn::ref_ptr&)>::operator()(dsn::ref_ptrdsn::replication::mutation&) const (this=0x7fef5c83ba98, __args#0=...)
9 0x00007ff01efed142 in dsn::replication::mutation_log::lambda14::operator() (closure=0x7fef5c83ba90, mu=...) at /home/work/pegasus/rDSN/src/dist/replication/lib/mutation_log.cpp:681
10 0x00007ff01eff667f in std::_Function_handler<bool(dsn::ref_ptr&), dsn::replication::mutation_log::open(dsn::replication::mutation_log::replay_callback, dsn::replication::mutation_log::io_failure_callback, const std::map<dsn::gpid, long int>&)::__lambda14>::_M_invoke(const std::_Any_data &, dsn::ref_ptrdsn::replication::mutation &) (
11 0x00007ff01effd3a9 in std::function<bool (dsn::ref_ptr&)>::operator()(dsn::ref_ptrdsn::replication::mutation&) const (this=0x7fef6eff9840, __args#0=...)
12 0x00007ff01efef9f6 in dsn::replication::mutation_log::replay(dsn::ref_ptrdsn::replication::log_file, std::function<bool (dsn::ref_ptr&)>, long&) (log=...,
Python Exception <type 'exceptions.IndexError'> list index out of range:
13 0x00007ff01eff0404 in dsn::replication::mutation_log::replay(std::map<int, dsn::ref_ptr, std::less, std::allocator<std::pair<int const, dsn::ref_ptr > > >&, std::function<bool (dsn::ref_ptr&)>, long&) (logs=std::map with 4 elements, callback=...,
Python Exception <type 'exceptions.IndexError'> list index out of range:
14 0x00007ff01efee393 in dsn::replication::mutation_log::open(std::function<bool (dsn::ref_ptr&)>, std::function<void (dsn::error_code)>, std::map<dsn::gpid, long, std::less, std::allocator<std::pair<dsn::gpid const, long> > > const&) (this=0x7fef5c002fc0, read_callback=..., write_error_callback=...,
15 0x00007ff01efc22c5 in dsn::replication::replica_stub::initialize (this=0x20a3fa0, opts=..., clear=false) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_stub.cpp:262
16 0x00007ff01ef7f33d in dsn::replication::replication_service_app::start (this=0x20a3f60, argc=1, argv=0x7fef5c001800)
Cause of this:
Possible fix: What about generate the checkpoint in every learn stage? @imzhenyu @qinzuoyan