imzhenyu / rDSN.dist.service

Service frameworks targeting high availability, reliability, scalability, and consistency for micro services and storages
MIT License
1 stars 3 forks source link

assertion failure due to crashing when generating checkpoint in learning #13

Open shengofsun opened 8 years ago

shengofsun commented 8 years ago

0 0x00007ff01fc385c9 in raise () from /lib64/libc.so.6

1 0x00007ff01fc39cd8 in abort () from /lib64/libc.so.6

2 0x00007ff020f1c5af in dsn_coredump () at /home/work/pegasus/rDSN/src/core/core/service_api_c.cpp:307

3 0x00007ff01ef76259 in dsn::replication::prepare_list::commit (this=0x7fef3c0da440, d=27651077, ct=dsn::replication::COMMIT_TO_DECREE_HARD)

at /home/work/pegasus/rDSN/src/dist/replication/lib/prepare_list.cpp:169

4 0x00007ff01ef75ea9 in dsn::replication::prepare_list::prepare (this=0x7fef3c0da440, mu=..., status=dsn::replication::partition_status::PS_INACTIVE)

at /home/work/pegasus/rDSN/src/dist/replication/lib/prepare_list.cpp:136

5 0x00007ff01ef4672c in dsn::replication::replica::replay_mutation (this=0x7fef3c0aa1a0, mu=..., is_private=false) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_init.cpp:460

6 0x00007ff01efc1140 in dsn::replication::replica_stub::lambda12::operator() (closure=0x7fef5c81efa0, mu=...) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_stub.cpp:253

7 0x00007ff01efcaf9a in std::_Function_handler<bool(dsn::ref_ptr&), dsn::replication::replica_stub::initialize(const dsn::replication::replication_options&, bool)::__lambda12>::_M_invoke(const std::_Any_data &, dsn::ref_ptrdsn::replication::mutation &) (functor=..., args#0=...) at /usr/include/c++/4.8.2/functional:2057

8 0x00007ff01effd3a9 in std::function<bool (dsn::ref_ptr&)>::operator()(dsn::ref_ptrdsn::replication::mutation&) const (this=0x7fef5c83ba98, __args#0=...)

at /usr/include/c++/4.8.2/functional:2464

9 0x00007ff01efed142 in dsn::replication::mutation_log::lambda14::operator() (closure=0x7fef5c83ba90, mu=...) at /home/work/pegasus/rDSN/src/dist/replication/lib/mutation_log.cpp:681

10 0x00007ff01eff667f in std::_Function_handler<bool(dsn::ref_ptr&), dsn::replication::mutation_log::open(dsn::replication::mutation_log::replay_callback, dsn::replication::mutation_log::io_failure_callback, const std::map<dsn::gpid, long int>&)::__lambda14>::_M_invoke(const std::_Any_data &, dsn::ref_ptrdsn::replication::mutation &) (

__functor=..., __args#0=...) at /usr/include/c++/4.8.2/functional:2057

11 0x00007ff01effd3a9 in std::function<bool (dsn::ref_ptr&)>::operator()(dsn::ref_ptrdsn::replication::mutation&) const (this=0x7fef6eff9840, __args#0=...)

at /usr/include/c++/4.8.2/functional:2464

12 0x00007ff01efef9f6 in dsn::replication::mutation_log::replay(dsn::ref_ptrdsn::replication::log_file, std::function<bool (dsn::ref_ptr&)>, long&) (log=...,

callback=..., end_offset=@0x7fef6eff9a38: 49708101254) at /home/work/pegasus/rDSN/src/dist/replication/lib/mutation_log.cpp:915

Python Exception <type 'exceptions.IndexError'> list index out of range:

13 0x00007ff01eff0404 in dsn::replication::mutation_log::replay(std::map<int, dsn::ref_ptr, std::less, std::allocator<std::pair<int const, dsn::ref_ptr > > >&, std::function<bool (dsn::ref_ptr&)>, long&) (logs=std::map with 4 elements, callback=...,

end_offset=@0x7fef6eff9a38: 49708101254) at /home/work/pegasus/rDSN/src/dist/replication/lib/mutation_log.cpp:1012

Python Exception <type 'exceptions.IndexError'> list index out of range:

14 0x00007ff01efee393 in dsn::replication::mutation_log::open(std::function<bool (dsn::ref_ptr&)>, std::function<void (dsn::error_code)>, std::map<dsn::gpid, long, std::less, std::allocator<std::pair<dsn::gpid const, long> > > const&) (this=0x7fef5c002fc0, read_callback=..., write_error_callback=...,

replay_condition=std::map with 11 elements) at /home/work/pegasus/rDSN/src/dist/replication/lib/mutation_log.cpp:673

15 0x00007ff01efc22c5 in dsn::replication::replica_stub::initialize (this=0x20a3fa0, opts=..., clear=false) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_stub.cpp:262

16 0x00007ff01ef7f33d in dsn::replication::replication_service_app::start (this=0x20a3f60, argc=1, argv=0x7fef5c001800)

at /home/work/pegasus/rDSN/src/dist/replication/lib/replication_service_app.cpp:76

Cause of this:

  1. A learner learns private log from primary, then applies the log, and starts to learn cache.
  2. After learning cache from the primary, it append the cache content to its private log and shared log.
  3. The learner starts to generate checkpoint, and during which the learner crashed.
  4. During the restarting of the learner, it tries to replay the log. But the content in its log is not continuous with the checkpoint. So it crashes.

Possible fix: What about generate the checkpoint in every learn stage? @imzhenyu @qinzuoyan

imzhenyu commented 8 years ago

@shengofsun This has been considered before and it happens when the checkpoint is done or an existing checkpint somehow is missing or with wrong state. In that case, we have INCOMPLETE_STATE error for app::open_internal. It seems that our handling in replica::init_app_and_prepare_list and replica_stub::initialize may have some issues (e.g., forgot clearing out the private log or something). You may want to have a check. Generating checkpoint in every stage won't fix this issue.

qinzuoyan commented 8 years ago

@imzhenyu , surely there is some problem on handling of INCOMPLETE_STATE, would you have time to review and fix the code together with us?

imzhenyu commented 8 years ago

Sure. I'll work on it then and you guys may need to help do the review and test later. Should be done today.

shengofsun commented 8 years ago

@imzhenyu generating checkpoint in every stage fix this, as the log will be continuous when then learner starts to write mutations to log when learning from cache. Or else we need to abandon the extra mutations when initializing, which is not easy.

BTW, I'm not quite understand why the log need to be truncated by the "init_info". Is it only an optimization or a necessity for correctness?

imzhenyu commented 8 years ago

being continuous is not easy when certain failures happen (e.g., disk failure), even you generate checkpoints at every stage. So in any case we have to handle incomplete data, which invalidates the later private or shared logs. A simple way as the above prs simply clears out the replicas whenever there is an error during load. Previously I want to do certain optimizations which introduces some problems.