imzhenyu / rDSN

Robust Distributed System Nucleus (rDSN) is an open framework for quickly building and managing high performance and robust distributed systems.
MIT License
33 stars 11 forks source link

coredump at replica::update_local_configuration() #504

Closed qinzuoyan closed 8 years ago

qinzuoyan commented 8 years ago

found by kill_test.

last log:

D18:57:51.211 (1469905071211742378 8248) replica5.replica0.0800823c00010009: replica.stub:714:on_add_learner(): 1.8@10.108.187.19:34805: received add learner, primary = 10.108.187.19:34803, ballot = 2537, status = replication::partition_status::PS_POTENTIAL_SECONDARY, last_committed_decree = 3062357
D18:57:51.211 (1469905071211783001 8248) replica5.replica0.0800823c00010009: replica.learn:997:on_add_learner(): 1.8@10.108.187.19:34805: process add learner, primary = 10.108.187.19:34803, ballot = 2537, status = replication::partition_status::PS_POTENTIAL_SECONDARY, last_committed_decree = 3062357
F18:57:51.211 (1469905071211802025 8248) replica5.replica0.0800823c00010009: /home/work/pegasus/rDSN/src/dist/replication/lib/replica_config.cpp:654:update_local_configuration(): assertion expression: false
F18:57:51.211 (1469905071211818785 8248) replica5.replica0.0800823c00010009: /home/work/pegasus/rDSN/src/dist/replication/lib/replica_config.cpp:654:update_local_configuration(): invalid execution path

coredump stack:

(gdb) bt
#0  0x00007fc050b605c9 in raise () from /lib64/libc.so.6
#1  0x00007fc050b61cd8 in abort () from /lib64/libc.so.6
#2  0x00007fc051e721e3 in dsn_coredump () at /home/work/pegasus/rDSN/src/core/core/service_api_c.cpp:307
#3  0x00007fc04feaf8f6 in dsn::replication::replica::update_local_configuration (this=0x7fbf7c000d60, config=..., same_ballot=true)
    at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_config.cpp:654
#4  0x00007fc04fee10e2 in dsn::replication::replica::on_add_learner (this=0x7fbf7c000d60, request=...) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_learn.cpp:1008
#5  0x00007fc04ff11578 in dsn::replication::replica_stub::on_add_learner (this=0x285a160, request=...) at /home/work/pegasus/rDSN/src/dist/replication/lib/replica_stub.cpp:719
#6  0x00007fc04ff22ec6 in bool dsn::serverlet<dsn::replication::replica_stub>::register_rpc_handler<dsn::replication::group_check_request>(int, char const*, void (dsn::replication::replica_stub::*)(dsn::replication::group_check_request const&), dsn_global_partition_id)::{lambda(void*, void*)#1}::operator()(void*, void*) const (__closure=0x0, request=0x7fbfac001754, param=0x7fbf8c345560)
    at /home/work/pegasus/rDSN/include/dsn/cpp/serverlet.h:154
#7  0x00007fc04ff22f25 in bool dsn::serverlet<dsn::replication::replica_stub>::register_rpc_handler<dsn::replication::group_check_request>(int, char const*, void (dsn::replication::replica_stub::*)(dsn::replication::group_check_request const&), dsn_global_partition_id)::{lambda(void*, void*)#1}::_FUN(void*, void*) (request=0x7fbfac001754, param=0x7fbf8c345560)
    at /home/work/pegasus/rDSN/include/dsn/cpp/serverlet.h:148
#8  0x00007fc051e0581b in dsn::rpc_handler_info::run (this=0x7fbf8c350ed0, req=0x7fbfac001754) at /home/work/pegasus/rDSN/include/dsn/tool-api/task.h:272
#9  0x00007fc051e058d8 in dsn::rpc_request_task::exec (this=0x7fbfac001818) at /home/work/pegasus/rDSN/include/dsn/tool-api/task.h:304
#10 0x00007fc051e027b3 in dsn::task::exec_internal (this=0x7fbfac001818) at /home/work/pegasus/rDSN/src/core/core/task.cpp:204
#11 0x00007fc051e6ed4b in dsn::task_worker::loop (this=0x27f3e00) at /home/work/pegasus/rDSN/src/core/core/task_worker.cpp:357
#12 0x00007fc051e6ec9b in dsn::task_worker::run_internal (this=0x27f3e00) at /home/work/pegasus/rDSN/src/core/core/task_worker.cpp:334
#13 0x00007fc051e715dd in std::_Mem_fn<void (dsn::task_worker::*)()>::operator()<, void>(dsn::task_worker*) const (this=0x2847c30, __object=0x27f3e00) at /usr/include/c++/4.8.2/functional:601
#14 0x00007fc051e7153c in std::_Bind<std::_Mem_fn<void (dsn::task_worker::*)()> (dsn::task_worker*)>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) (this=0x2847c30, 
    __args=<unknown type in /home/work/pegasus/rDSN/install/lib/libdsn.core.so, CU 0x19d1c6c, DIE 0x1a431c8>) at /usr/include/c++/4.8.2/functional:1296
#15 0x00007fc051e714a8 in std::_Bind<std::_Mem_fn<void (dsn::task_worker::*)()> (dsn::task_worker*)>::operator()<, void>() (this=0x2847c30) at /usr/include/c++/4.8.2/functional:1355
#16 0x00007fc051e7143a in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (dsn::task_worker::*)()> (dsn::task_worker*)> ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x2847c30)
    at /usr/include/c++/4.8.2/functional:1732
#17 0x00007fc051e71387 in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (dsn::task_worker::*)()> (dsn::task_worker*)> ()>::operator()() (this=0x2847c30) at /usr/include/c++/4.8.2/functional:1720
#18 0x00007fc051e71320 in std::thread::_Impl<std::_Bind_simple<std::_Bind<std::_Mem_fn<void (dsn::task_worker::*)()> (dsn::task_worker*)> ()> >::_M_run() (this=0x2847c18)
    at /usr/include/c++/4.8.2/thread:115
#19 0x00007fc0514b8da0 in ?? () from /lib64/libstdc++.so.6
#20 0x00007fc05254fdf3 in start_thread () from /lib64/libpthread.so.0
#21 0x00007fc050c211ad in clone () from /lib64/libc.so.6
imzhenyu commented 8 years ago

Nice. Will look into this later. A first guess is about primary to potential secondary?

shengofsun commented 8 years ago

@imzhenyu @qinzuoyan Why the bug happens is as follows:

  1. A replica R crashed
  2. Meta server detects R's crash, and starts to remove primaries from R. Firstly, meta updates the config to remote storage, while keep the local storage the old one. 3, The R comes back to alive, then starts to sync the config from meta. Meta returns the old config to it. So R still regards itself primary.
  3. After meta updates the remote's config, it updates it locally. Then another replica serve S is reassigned as new primary.
  4. Then meta orders S to add a new potential secondary for R.

So what about make primary -> potential_sec a valid switch if with larger ballot? Or simply just ignore all quey_config RPCs in meta when a config is syncing with remote?

imzhenyu commented 8 years ago

This is a violation of the perfect FD. Make switch valid breaks still. Ignore query-config amplifies the failure. A better way maybe:

shengofsun commented 8 years ago
imzhenyu commented 8 years ago
shengofsun commented 8 years ago
shengofsun commented 8 years ago

In addition, making the "ps_primary->ps_potential_secondary" do violate the perfect FD, but I think for our PacificA is safe. The old primary can't commit any new mutations, and can only serve read for committed mutations.

shengofsun commented 8 years ago

@qinzuoyan @imzhenyu I'm trying to fix this with ignoring a config sync when a partition is syncing with remote storage. You may want to review

imzhenyu commented 8 years ago

@shengofsun How about we only ignore the reply for that specific partition? I'm worried about the failure amplification issue for the other replicas - a common pattern we see that may lead to cascading failures in the end.

shengofsun commented 8 years ago

Currently my fix is to add a new RPC for replica to do the config-sync, instead of the old query-configuration-by-node. So I don't see this a big deal. For clients who wants to query the cluster information from meta, there will be a response immediately, regardless of whether it is stale. For replica servers, only the "config-sync" is impacted. And on the other hand, we shouldn't response partial information for a "config-sync", as replica server will remove replicas if they are not found on meta server. In case of your worry, I can optimize this by checking whether a syncing partition is related to a node. Say, we may response a "config-sync" when the partition is adding a new node or removing other node:)

imzhenyu commented 8 years ago

I mean skipping the reply unnecessarily make the time of unavailability of other replicas longer, which may cause unnecessary fail-over. In certain cases, it can lead to cascading failures that take down many replicas offline. A better remedy could be let's mark specifically for that particular replica being under syncing, or we pre-calculate the result for that replica, and send calculated results to the sync clients.

qinzuoyan commented 8 years ago

用中文说下我的理解吧。

这里出core的原因:

根据上面后一种情况继续,on_ping_internal() 在 check_all_records() 之后执行,重新建立心跳连接。replica-server在心跳建立后,会立即发起config_sync请求。meta-server在收到config_sync请求后,调用meta_service::on_query_configuration_by_node()获取configuration。如果该操作在“心跳超时触发的config更新”操作完成之前进行,就会获取到旧的状态,认为自己还是primary,引发后续的coredump。

我们在测试服务器上跑kill_test的时候,这个问题很难复现;而在单机跑kill_test的时候,这个问题较易复现。原因我认为是:单机的RPC太快了,造成重新建立心跳和发送config_sync的过程太快,增加了“query_config操作”在“心跳超时触发的config更新”操作完成之前执行的概率。

修复思路是:

然后讨论meta-server如果忽略config_sync()请求,会有什么后果:

qinzuoyan commented 8 years ago

"I mean skipping the reply unnecessarily make the time of unavailability of other replicas longer, which may cause unnecessary fail-over."

@imzhenyu , I don't think it's a big problem. For initialing replica-server, if config_sync is ignored, then another config_sync will be started soon; for normal replica-server, ignoring config_sync would not affect availability.

Or, we can make meta-server return some error code like ERR_BUSY to replica-server, but not depending on timeout.

qinzuoyan commented 8 years ago

@imzhenyu ,我们对 @shengofsun 修改后的代码运行kill_test,已经一个晚上没有复现core了,要不我们先把改动merge了,然后将issue保留着,后面看还是否能更完美地解决?

imzhenyu commented 8 years ago

@qinzuoyan Thanks. ERR_BUSY will be much better instead of simply ignoring the replica request. The replica should try again very soon (e.g., after several hundreds of ms) to reduce the chance of unnecessary fail-over of the other replicas.

imzhenyu commented 8 years ago

@qinzuoyan @shengofsun 已经merge。这里主要有两点后面可以继续讨论:

shengofsun commented 8 years ago

@imzhenyu @qinzuoyan

imzhenyu commented 8 years ago

Close since it is done - can open later if we have better improvement later.