helium / blockchain-core

Apache License 2.0
214 stars 86 forks source link

properly record rejector height for deferred txns #1461

Closed andymck closed 2 years ago

andymck commented 2 years ago

Fixes a bug when dealing with deferred txns which resulted in the txn mgr crashing out and dropping txns.

When txn mgr receives a rejection from a CG member for a submitted txn, it checks the height of the CG member against its own height. If the txn mgr is behind the CG member then the rejection is added to a pending list called deferred rejections.

Each add block event this list is processed and any pending rejection which matches the current height of the block being handled by txn mgr is then processed and applied. As we are processing this list each block it was assumed the txn mgr would never run ahead of the height of a rejection. This assumption was implemented here:

https://github.com/helium/blockchain-core/blob/46f514bc9abaad91507d502e1a34d8f66ada9588/src/transactions/blockchain_txn_mgr.erl#L536

However the assumption was not valid. The log extract below shows the current height of txn mgr:

2022-08-10 18:51:58.036 [debug] <0.19661.1698>@blockchain_txn_mgr:handle_add_block_event:{512,13} received block height: 1472629, updated state block height: 1472629

There is then a crash at the above mentioned line 536 :

022-08-10 18:51:58.037 [debug] <0.19661.1698>@blockchain_txn_mgr:terminate:{464,5} terminating with reason {{badmatch,{[{<0.18651.1882>,<<"-576328476948906414">>......

the binary value <<"-576328476948906414">> is the key of the txn and other logs show this same txn in the deferred list with a height less than the current txn mgr height

blockchain.log:2022-08-10 18:51:58.037 [debug] <0.19661.1698>@blockchain_txn_mgr:terminate:{464,5} terminating with reason {{badmatch,{[{<0.18651.1882>,<<"-576328476948906414">>,{blockchain_txn_routing_v1_pb,1,<<0,241,20,68,146,24,117,226,239,116,53,81,58,29,31,27,15,164,158,50,66,149,106,36,56,57,18,236,93,79,25,64,119>>,{update_xor,{update_xor_pb,4,<<193,92,2,137,236,45,10,145,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,133,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0>>}},25000,23898,<<48,69,2,32,112,218,195,203,152,15,229,105,124,233,132,239,59,133,159,161,144,74,56,118,137,173,233,108,220,118,44,92,4,68,17,0,2,33,0,239,92,69,202,33,20,110,44,175,29,25,218,117,92,120,142,171,70,55,96,47,99,23,46,139,82,254,112,114,180,106,145>>,0},<<0,178,4,179,76,225,237,59,224,193,22,222,165,41,233,222,129,255,59,22,187,208,174,206,6,33,81,200,173,207,74,205,44>>,{blockchain_txn_info_v1_pb,<<"txn_rejected">>,<<"bad_nonce">>,<<131,100,0,9,117,110,100,101,102,105,110,101,100>>,1472628,0,0,3}},

Note the blockchain_txn_info_v1_pb record at the end and its height value of 1472628, which means the deferred txns were processed with a state height of 1472629 whilst there was a deferred txn with a reject height of 1472628. This should never happen.

However looking through the logs the root cause is clear:

2022-08-10 18:51:20.057 [debug] <0.19661.1698>@blockchain_txn_mgr:handle_add_block_event:{512,13} received block height: 1472627, updated state block height: 1472626 2022-08-10 18:51:52.803 [debug] <0.19661.1698>@blockchain_txn_mgr:handle_add_block_event:{512,13} received block height: 1472628, updated state block height: 1472626 2022-08-10 18:51:58.036 [debug] <0.19661.1698>@blockchain_txn_mgr:handle_add_block_event:{512,13} received block height: 1472629, updated state block height: 1472629

Here we see some add block events prior to the crash.

In the log extracts, the received block height is the height of the current block being processed ( it may or may not be a sync block )

The updated state block height: is that of the txn mgr state height. This state height is buggy and returning inconsistent values.

It is this state height which was being used by process_deferred_rejections and in the above events it was never called with height 1472627 and 1472628. This resulted in rejections with those heights being left in the deferred list which then broke the assumption on line 536 resulting in the crash.

The fix here is to always call process_deferred_rejections with the block height and not the State height and then to be overly cautious remove the assumption on line 536 and allow rejections with a height less than that of the txn mgr height. Any such rejections encountered will be acted up in the same way we act on the Current list