Calling wdb.rescan from http, can sometimes hang the whole node when wallet is run as a plugin to the node. This happens because the chain.locker and wdb.txLock sequence is swapped in wdb.rescan. It can happen when a node is adding/removing/reorging blocks and we request rescan. Here how it can happen and lock descriptions:
wdb.addBlock can't go forward because wdb.rescan has locked txLock
chain.add can't finish because it's waiting for wdb.addBlock.
wdb.rescan -> chain.scan can't go forward because chain.locker is locked.
This issue is side effect of OOM fix for the plugins in https://github.com/bcoin-org/bcoin/pull/932. Previously, chain would not wait for the wallet to finish addBlock, instead could move forward and unlock chain.locker when chain was done processing that block. Now it waits for the wallet to also finish the process. That, of course, caused issue when chain was moving forward much faster than wallet, wallet would have backlogged list of addBlocks with all relevant block/tx informations eventually causing OOM.
Note that this wont happen if the wallet is running separately as a service. Separate wallet service will experience a backlog instead, because the chain won't wait for HTTP socket events to finish processing. (And maybe it should, but that's a separate issue and continuation of the https://github.com/bcoin-org/bcoin/pull/932)
Related
Closes #736
Changes
Using wdb.scan directly is deprecated, it no longer sets rescanning to true (it's @private so no problems there)
wdb.rescan and wdb._rescan (lock and w/o lock) are the proper places to set the rescanning to true.
initial sync (syncNode) will set rescanning right away, to avoid extra connect block events.
Ahead sync from walletDB (when wallet was disconnected) will now use ._rescan to ensure .rescanning is set.
Diff is messed up because I moved old wallet-rescan-test to wallet-namestate-rescan-test and used wallet-rescan-test for this one. For better diffing experience, go to the commits themselves.
coverage: 68.553% (+0.004%) from 68.549% when pulling 29625eb31689b75914d7243664bacfe7a829b540 on nodech:wallet-deadlock-fix into bb7da60ef3ffdce0be6ccb55185cef89268be671 on handshake-org:master.
Calling wdb.rescan from http, can sometimes hang the whole node when wallet is run as a plugin to the node. This happens because the chain.locker and wdb.txLock sequence is swapped in wdb.rescan. It can happen when a node is adding/removing/reorging blocks and we request rescan. Here how it can happen and lock descriptions:
Here we can see that there are partial sequences that lead to deadlock:
This issue is side effect of OOM fix for the plugins in https://github.com/bcoin-org/bcoin/pull/932. Previously, chain would not wait for the wallet to finish addBlock, instead could move forward and unlock chain.locker when chain was done processing that block. Now it waits for the wallet to also finish the process. That, of course, caused issue when chain was moving forward much faster than wallet, wallet would have backlogged list of addBlocks with all relevant block/tx informations eventually causing OOM.
Note that this wont happen if the wallet is running separately as a service. Separate wallet service will experience a backlog instead, because the chain won't wait for HTTP socket events to finish processing. (And maybe it should, but that's a separate issue and continuation of the https://github.com/bcoin-org/bcoin/pull/932)
Related
Changes
wdb.scan
directly is deprecated, it no longer sets rescanning to true (it's@private
so no problems there)._rescan
to ensure.rescanning
is set.