Open akolotov opened 6 years ago
The first of transactions arrived on Ropsten: https://ropsten.etherscan.io/tx/0x42d204dbf87b8b53c00a1ce11c79ed6eee91e189ebb1106211db7f24ba20f4b6.
The last transaction: https://ropsten.etherscan.io/tx/0xb53ea50e62bfcfda6ba1441d243a4df56f472b2e7d9b895943dbd8dcb8257d47.
Totally it was about 17000 txs sent by the bridge till it was killed.
On it
Thank you Do you have a script that reproduces this issue?
I think the root cause of the issue could relate to hardware on my testbed: it is an ordinary laptop with Intel i5-3317U CPU 1.70GHz, 4 cores, HDD (not SSD) and WiFi connection to Internet. It is not my main workstation - it is dedicated just for tests: centos linux without X.org installed, parity, bridge - no other processes is being run.
So, from the log above we can see that parity is actually sending a response to those timed out requests (bridge doesn't lose the connection, it simply abandons it after a timeout is experienced). The way bridge is structured it effectively considers timed out transactions to "never happen".
There are perhaps a couple of measures we can take here:
I think the most efficient first step on my end here would be (3). After that, we can do (2).
I completely agree with you thoughts. #1 is too platform specific and do not provide any guaranty that in some moment a system will not get in a state when the timeout is too short again.
Would you mind trying https://github.com/yrashk/parity-bridge/commit/304a8431cc1dac4904616178cb451259dc5f9152 out on your hardware setup? This is a first draft. This change limits the size of the batch. The integration test pass. Let me know if this helps or not.
I have tested the changes with 1K and 2K deposits (transactions). 1K deposits transferred successfully. The original issue was reproduced with 2K deposits.
Here is difference in bridge logs: 1K deposts:
INFO:bridge::bridge::deposit_relay: got 353 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 353 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::deposit_relay: got 300 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 300 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
2K deposts:
INFO:bridge::bridge::deposit_relay: got 710 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 710 deposits
WARN:bridge: Bridge is down with Request eth_sendTransaction timed out, attempting to restart
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x6c881ae94549f05a09088e1ed017324eb4909e9957d61f5ad9a4967a2ac04ad3"))])
So, the bridge combined two sequential blocks in one batch for the second test.
It means that it is makes sense to understand why Parity behaves differently in these two cases. Most probably we will see a proper fix for the issue in that case.
Do we still experience this issue in any severe form?
Did not test it with new version of bridge supporting RPC. Are you able to do generate traffic (dozen of transaction in one block) and test it by yourself?
This is the issue is the same as https://github.com/paritytech/parity-bridge/issues/149 but behavior is even worse due to automatic bridge restart implemented in POA bridge.
Network setup:
There are 1200 deposit transactions sent successfully to
HomeBridge
contract by a special python script. It took 8 blocks to validate all transactions. https://sokol-explorer.poa.network/block/1418808 ... https://sokol-explorer.poa.network/block/1418815The bridge discovered part of these transactions, tried to relay some of them, lost the connection and restarted:
The database file was not updated so the restart of the bridge thread caused the same error.
So, the bridge is continuing to send transactions forever.
Even if the bridge process is killed manually, it is necessary to do manual modification of database but it causes lock of funds on
HomeBridge
contract side since incomplete amount of tokens is transfered byForeignBridge
contract.