Closed qinzuoyan closed 8 years ago
Nice. Will look into this later. A first guess is about primary to potential secondary?
@imzhenyu @qinzuoyan Why the bug happens is as follows:
So what about make primary -> potential_sec a valid switch if with larger ballot? Or simply just ignore all quey_config RPCs in meta when a config is syncing with remote?
This is a violation of the perfect FD. Make switch valid breaks still. Ignore query-config amplifies the failure. A better way maybe:
In addition, making the "ps_primary->ps_potential_secondary" do violate the perfect FD, but I think for our PacificA is safe. The old primary can't commit any new mutations, and can only serve read for committed mutations.
@qinzuoyan @imzhenyu I'm trying to fix this with ignoring a config sync when a partition is syncing with remote storage. You may want to review
@shengofsun How about we only ignore the reply for that specific partition? I'm worried about the failure amplification issue for the other replicas - a common pattern we see that may lead to cascading failures in the end.
Currently my fix is to add a new RPC for replica to do the config-sync, instead of the old query-configuration-by-node
. So I don't see this a big deal. For clients who wants to query the cluster information from meta, there will be a response immediately, regardless of whether it is stale.
For replica servers, only the "config-sync" is impacted.
And on the other hand, we shouldn't response partial information for a "config-sync", as replica server will remove replicas if they are not found on meta server.
In case of your worry, I can optimize this by checking whether a syncing partition is related to a node. Say, we may response a "config-sync" when the partition is adding a new node or removing other node:)
I mean skipping the reply unnecessarily make the time of unavailability of other replicas longer, which may cause unnecessary fail-over. In certain cases, it can lead to cascading failures that take down many replicas offline. A better remedy could be let's mark specifically for that particular replica being under syncing, or we pre-calculate the result for that replica, and send calculated results to the sync clients.
用中文说下我的理解吧。
这里出core的原因:
根据上面后一种情况继续,on_ping_internal() 在 check_all_records() 之后执行,重新建立心跳连接。replica-server在心跳建立后,会立即发起config_sync请求。meta-server在收到config_sync请求后,调用meta_service::on_query_configuration_by_node()获取configuration。如果该操作在“心跳超时触发的config更新”操作完成之前进行,就会获取到旧的状态,认为自己还是primary,引发后续的coredump。
我们在测试服务器上跑kill_test的时候,这个问题很难复现;而在单机跑kill_test的时候,这个问题较易复现。原因我认为是:单机的RPC太快了,造成重新建立心跳和发送config_sync的过程太快,增加了“query_config操作”在“心跳超时触发的config更新”操作完成之前执行的概率。
修复思路是:
然后讨论meta-server如果忽略config_sync()请求,会有什么后果:
"I mean skipping the reply unnecessarily make the time of unavailability of other replicas longer, which may cause unnecessary fail-over."
@imzhenyu , I don't think it's a big problem. For initialing replica-server, if config_sync is ignored, then another config_sync will be started soon; for normal replica-server, ignoring config_sync would not affect availability.
Or, we can make meta-server return some error code like ERR_BUSY to replica-server, but not depending on timeout.
@imzhenyu ,我们对 @shengofsun 修改后的代码运行kill_test,已经一个晚上没有复现core了,要不我们先把改动merge了,然后将issue保留着,后面看还是否能更完美地解决?
@qinzuoyan Thanks. ERR_BUSY
will be much better instead of simply ignoring the replica request. The replica should try again very soon (e.g., after several hundreds of ms) to reduce the chance of unnecessary fail-over of the other replicas.
@qinzuoyan @shengofsun 已经merge。这里主要有两点后面可以继续讨论:
@imzhenyu @qinzuoyan
Close since it is done - can open later if we have better improvement later.
found by kill_test.
last log:
coredump stack: