Bridge continiously sending transactions with 'eth_sendTransaction timed out' message during stress testing

akolotov commented 6 years ago

This is the issue is the same as https://github.com/paritytech/parity-bridge/issues/149 but behavior is even worse due to automatic bridge restart implemented in POA bridge.

Network setup:

Home: a PoA testnet (Sokol)
Foreign: Ropsten

There are 1200 deposit transactions sent successfully to HomeBridge contract by a special python script. It took 8 blocks to validate all transactions. https://sokol-explorer.poa.network/block/1418808 ... https://sokol-explorer.poa.network/block/1418815

The bridge discovered part of these transactions, tried to relay some of them, lost the connection and restarted:

INFO:bridge::bridge::deposit_relay: got 7 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 7 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::deposit_relay: got 129 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 129 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::withdraw_relay: got 0 new signed withdraws to relay
INFO:bridge::bridge::withdraw_relay: fetching messages and signatures
INFO:bridge::bridge::withdraw_relay: fetching messages and signatures complete
INFO:bridge::bridge::withdraw_relay: relaying 0 withdraws
INFO:bridge::bridge::withdraw_relay: relaying withdraws complete
INFO:bridge::bridge::withdraw_relay: waiting for signed withdraws to relay
INFO:bridge::bridge::withdraw_confirm: got 0 new withdraws to sign
INFO:bridge::bridge::withdraw_confirm: signing
INFO:bridge::bridge::withdraw_confirm: signing complete
INFO:bridge::bridge::withdraw_confirm: submitting 0 signatures
INFO:bridge::bridge::withdraw_confirm: submitting signatures complete
INFO:bridge::bridge::withdraw_confirm: waiting for new withdraws that should get signed
INFO:bridge::bridge::deposit_relay: got 113 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 113 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::deposit_relay: got 710 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 710 deposits
WARN:bridge: Bridge is down with Request eth_sendTransaction timed out, attempting to restart
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0xc47e8427c9eda913b9749bf0904cd8765b39f5448368194cb7aa4330bbe6a44d"))])
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x4d96f7f39c50dd7b19aac9af708bca450fb6c9fb6b72024c7736d9883614845d"))])
...
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x6b16c0ef9e34d65c4431c4e1e8c043564d53cf3ce440d7ea706164838904e209"))])

The database file was not updated so the restart of the bridge thread caused the same error.

INFO:bridge::bridge::deposit_relay: got 951 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 951 deposits
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x9b034a82ebf6306933a35848f9d7b1552ababb8be5774e5501e4802013911a01"))])
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x606b57af0b13dbeddf6b4c16833b0365dace777cc4feee82d2de14c5dde53c45"))])

So, the bridge is continuing to send transactions forever.

Even if the bridge process is killed manually, it is necessary to do manual modification of database but it causes lock of funds on HomeBridge contract side since incomplete amount of tokens is transfered by ForeignBridge contract.

akolotov commented 6 years ago

The first of transactions arrived on Ropsten: https://ropsten.etherscan.io/tx/0x42d204dbf87b8b53c00a1ce11c79ed6eee91e189ebb1106211db7f24ba20f4b6.

The last transaction: https://ropsten.etherscan.io/tx/0xb53ea50e62bfcfda6ba1441d243a4df56f472b2e7d9b895943dbd8dcb8257d47.

Totally it was about 17000 txs sent by the bridge till it was killed.

yrashk commented 6 years ago

On it

akolotov commented 6 years ago

config.toml - https://gist.github.com/akolotov/5c68b9438c991401df12450ad569a4f2

yrashk commented 6 years ago

Thank you Do you have a script that reproduces this issue?

akolotov commented 6 years ago

https://github.com/poanetwork/parity-bridge-research/blob/master/erc20/bridge/contracts/home_batch_deposit.py

akolotov commented 6 years ago

I think the root cause of the issue could relate to hardware on my testbed: it is an ordinary laptop with Intel i5-3317U CPU 1.70GHz, 4 cores, HDD (not SSD) and WiFi connection to Internet. It is not my main workstation - it is dedicated just for tests: centos linux without X.org installed, parity, bridge - no other processes is being run.

yrashk commented 6 years ago

So, from the log above we can see that parity is actually sending a response to those timed out requests (bridge doesn't lose the connection, it simply abandons it after a timeout is experienced). The way bridge is structured it effectively considers timed out transactions to "never happen".

There are perhaps a couple of measures we can take here:

Increase timeout length. This does not require changing bridge code, just the config. In your example config, it is set to 10 seconds. What's the longest reasonable timeout you can think of?
Persist every outgoing transaction in a durable queue and always flush the queue before starting normal operations.
Limit the number of transactions that can be sent out simultaneously to relieve the pressure on Parity and prevent the timeout counter from being started too early (timeout is per-transaction)

yrashk commented 6 years ago

I think the most efficient first step on my end here would be (3). After that, we can do (2).

akolotov commented 6 years ago

I completely agree with you thoughts. #1 is too platform specific and do not provide any guaranty that in some moment a system will not get in a state when the timeout is too short again.

yrashk commented 6 years ago

Would you mind trying https://github.com/yrashk/parity-bridge/commit/304a8431cc1dac4904616178cb451259dc5f9152 out on your hardware setup? This is a first draft. This change limits the size of the batch. The integration test pass. Let me know if this helps or not.

akolotov commented 6 years ago

I have tested the changes with 1K and 2K deposits (transactions). 1K deposits transferred successfully. The original issue was reproduced with 2K deposits.

Here is difference in bridge logs: 1K deposts:

INFO:bridge::bridge::deposit_relay: got 353 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 353 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::deposit_relay: got 300 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 300 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed

2K deposts:

INFO:bridge::bridge::deposit_relay: got 710 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 710 deposits
WARN:bridge: Bridge is down with Request eth_sendTransaction timed out, attempting to restart
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x6c881ae94549f05a09088e1ed017324eb4909e9957d61f5ad9a4967a2ac04ad3"))])

So, the bridge combined two sequential blocks in one batch for the second test.

It means that it is makes sense to understand why Parity behaves differently in these two cases. Most probably we will see a proper fix for the issue in that case.

yrashk commented 6 years ago

Do we still experience this issue in any severe form?

akolotov commented 6 years ago

Did not test it with new version of bridge supporting RPC. Are you able to do generate traffic (dozen of transaction in one block) and test it by yourself?

omni / poa-bridge

Bridge continiously sending transactions with 'eth_sendTransaction timed out' message during stress testing #33