nanocurrency / nano-node

Nano is digital currency. Its ticker is: XNO and its currency symbol is: Ӿ
https://nano.org
BSD 3-Clause "New" or "Revised" License
3.47k stars 785 forks source link

RPC Callback sometimes out of order #1917

Open Srayman opened 5 years ago

Srayman commented 5 years ago

Description of bug:

When 2 transactions are processed by the node at the same time the RPC sometimes sends them in the wrong order. I've seen this happen multiple times but recently confirmed with another node operator that they saw the same issue for the blocks in this example.

Steps to reproduce the issue:

  1. 2 blocks are processed at the same time (ie. node local_timestamp is the same for both)
  2. RPC Callback is configured to process the transactions as they are provided. In my case, logged to a file or displayed on a visualization, it's apparent they were sent out of order.

Describe the results you received: These 2 block hashes came out of order from the account chain they belong too.

First F430D98F7E9916E2BEAF3D113951C31DD8B3B0765CBDF94E5636C263733BE013 Second 32505B6E66A688BB712FC1F7F43A31561645BB1BDF9D628411C439430CF2C50E

However the "second" hash is the "previous" of the first so it should have arrived first.

I checked with @meltingice to compare the timestamps he collects on his server for the callbacks and he recorded the following showing that his callback received them out of order also.

127.0.0.1:6379> get "block_timestamp/F430D98F7E9916E2BEAF3D113951C31DD8B3B0765CBDF94E5636C263733BE013"
"1556065934419"
127.0.0.1:6379> get "block_timestamp/32505B6E66A688BB712FC1F7F43A31561645BB1BDF9D628411C439430CF2C50E"
"1556065934420"

Describe the results you expected: Below is what I see from the blocks_info RPC and the height is correct so I would expect to see the blocks returned based on the height order with oldest first and newer second.

`{"blocks":{"32505B6E66A688BB712FC1F7F43A31561645BB1BDF9D628411C439430CF2C50E":{"block_account":"xrb_1r98z5i14ztss597ajy3y9no5kpm1bm3uj1gam4gstjhhy3othmus1mamxyz","amount":"80000000000000000000000000","balance":"60154423999586177472984088415927","height":"122183","local_timestamp":"1556065932","contents":"{\n \"type\": \"state\",\n \"account\": \"xrb_1r98z5i14ztss597ajy3y9no5kpm1bm3uj1gam4gstjhhy3othmus1mamxyz\",\n \"previous\": \"5AE0FC0CD5A388BCF8F9FEF0AC6DD32DFB3B3B8A2780B317D1E8C4A0E5F38455\",\n \"representative\": \"xrb_1r98z5i14ztss597ajy3y9no5kpm1bm3uj1gam4gstjhhy3othmus1mamxyz\",\n \"balance\": \"60154423999586177472984088415927\",\n \"link\": \"176AED0C125E8006139E636DCA686B6DE26E8A1AAEA155AAD34F009D24B9F23F\",\n \"link_as_account\": \"xrb_17ucxn836qn11rbswrufsbn8puh4ft73odo3cpof8mr1mnkdmwjzxz5ogghn\",\n \"signature\": \"0D77F439367B82891A7D958F5604296C461983B141C4C147BB1B9EA8D9C7911E84DF6C860EDDAB49FE7F9403B1BD4831AA53F87856C4A49B013914C98F41A408\",\n \"work\": \"c121b0fda160aed8\"\n}\n"},

"F430D98F7E9916E2BEAF3D113951C31DD8B3B0765CBDF94E5636C263733BE013":{"block_account":"xrb_1r98z5i14ztss597ajy3y9no5kpm1bm3uj1gam4gstjhhy3othmus1mamxyz","amount":"80000000000000000000000000","balance":"60154343999586177472984088415927","height":"122184","local_timestamp":"1556065932","contents":"{\n \"type\": \"state\",\n \"account\": \"xrb_1r98z5i14ztss597ajy3y9no5kpm1bm3uj1gam4gstjhhy3othmus1mamxyz\",\n \"previous\": \"32505B6E66A688BB712FC1F7F43A31561645BB1BDF9D628411C439430CF2C50E\",\n \"representative\": \"xrb_1r98z5i14ztss597ajy3y9no5kpm1bm3uj1gam4gstjhhy3othmus1mamxyz\",\n \"balance\": \"60154343999586177472984088415927\",\n \"link\": \"93C9AF4720096C46879EB0C7E96A65C7B098F28A110DC5B38E0E19A5CAEAA6C3\",\n \"link_as_account\": \"xrb_36ybox5k14deat5sxe89x7o8djxim5san6afrpsrw5isnq7gobp5o7157d91\",\n \"signature\": \"2DE6706226429FED09193A2F78736A79DB0F6F9DB0692925C1D6407E25FDA35C390660A7C3EDB4BDD3CD89BF578F50BA1C86A978B6CAB28D91421E30DE1F6804\",\n \"work\": \"f33b0d0e87fb735f\"\n}\n"}}}`

Additional information you deem important (e.g. issue happens only occasionally): This only seems to happen with higher transaction volume when the local timestamps could overlap.

Environment:

logs I did find in the node logs that both transactions were rebroadcasted to peers in the correct order.

[2019-04-23 20:32:12.671894]: Block 32505B6E66A688BB712FC1F7F43A31561645BB1BDF9D628411C439430CF2C50E was republished to peers
[2019-04-23 20:32:12.672895]: Block F430D98F7E9916E2BEAF3D113951C31DD8B3B0765CBDF94E5636C263733BE013 was republished to peers
zhyatt commented 5 years ago

With the callback being triggered immediately after confirmation there may not be any good way to guarantee order with the RPC callback with the random nature of network propagation. In the case above it is likely that after republishing these blocks to peers ~1ms apart, they were nearly simultaneously confirmed but the 2nd block votes got routed more quickly than the 1st one. Or perhaps some other combination of network propagation differences resulted in the node seeing confirmation for the 2nd block first, despite the fact that nodes won't confirm the 2nd one until they have confirmed the 1st one. Let me know if I am misunderstanding the issue here, but I don't think this is scenario is unexpected.

With that said, it seems like despite the title pointing to an issue with RPC callback order, it really is an issue with ordering of blocks_info RPC command ordering, sometimes not showing the frontier as the first block due to matching timestamps. Perhaps returning this descending based on block height makes sense.

Srayman commented 5 years ago

I provided the blocks_info details just for information, I don’t know if there is an issue with that RPC since I haven’t looked at it closely, I assume it would return the details in the same order the hashes are provided in the request, which would be fine and expected.

The issue is with the callback. You stated that the node won’t confirm blocks out of order, so would that imply that it should wait for confirmation on the first block, process the callback and then get confirmation on the 2nd block? I’m not very familiar with how the confirmation/callback process works technically.

Is there any type of array or batch where the callbacks are processed in a group? Or could the 0.5 second delay in vote by hash that’s being fixed in v19 be a factor where it might cause groups of blocks to be processed at the same time?

I mostly don’t understand how 2 blocks from the same account chain could be confirmed at the exact millisecond. I could see 2 blocks from different accounts running into this but considering the order of a single chain seems like it shouldn’t happen in that case.

zhyatt commented 5 years ago

@SergiySW Given the conditions above, is this expected behavior? I could see the argument that the callback should trigger in order (earlier block first) even if confirmation for the latter block was seen first because it should recognize the earlier block in active transactions and consider it confirmed by extension, due to V18 PR: https://github.com/nanocurrency/nano-node/pull/1438

clemahieu commented 5 years ago

Callbacks can come out of order for a lot of reasons, the callback is dispatched on a background thread and those could come out of order as decided by the operating system. Network packet loss could stall connections while data is being retransmitted, and on the receiving end calls may be serviced out of order even though this may not be happening in your case.

The only way to guarantee end-to-end callback ordering would be to hook ledger confirmation and hold up transaction processing until the callback is complete.

We wouldn't do this with http but we might provide this with IPC. Keep in mind that this could be a large performance hit if callbacks aren't serviced quickly.

Srayman commented 4 years ago

Going back through old issues, could this be related to confirmation height being set for "previous" blocks after the frontier has been confirmed? Could that maybe explain how the confirmation timestamp is slightly later on the "previous" block than the later block? I haven't looked into the code to see when inactive quorum blocks are broadcast through the callback and/or websocket but curious if that could be the cause.

wezrule commented 4 years ago

The confirmation height processor has been redesigned #2531, I did find some irregularities in the callback order when doing it. Would be good if you could try again when the next DB is out, thanks. And/or wait until https://github.com/nanocurrency/nano-node/pull/2544 makes it in too.

Srayman commented 4 years ago

Sounds great! I'll try to setup some test cases to see if I can easily recreate it on V20 and compare to V21 once it's released.

zhyatt commented 4 years ago

@wezrule Would #2730 also impact this ordering?

wezrule commented 4 years ago

So the ordering of callbacks per chain is unaffected. What that PR does is process the confirmed blocks in the order that they are added to the conf height processor, which for the most part should be the order they are confirmed by the network, but due to background thread processing among other things as Colin alluded to, this might not always be the case. But it should generally help yeh