mempool / electrs

An efficient re-implementation of Electrum Server in Rust
MIT License
74 stars 35 forks source link

Fix overlapping reads issue #19

Closed junderw closed 1 year ago

junderw commented 1 year ago

This fixes the periodical freezing issue on Liquid.

Background

  1. There is a separate thread that loops and takes a write() lock on the AssetRegistry RwLock.
  2. There are various places throughout the REST server code that take read() locks and hold on to them for quite a while.
  3. The way RwLock works in Rust is that as soon as someone tries to take the write lock, anyone who tries to take the read lock after someone tried to take the write lock, those reads will block until the writer got the lock and released the lock.

The issue

  1. There is one spot where a read lock is taken and while that read lock is still held a new read lock is taken recursively.
  2. If the write thread tries to take the write lock between those two recursive reads, the write will wait for the first read to drop, then the second read will wait for the write to drop, but the first read won't drop until the code completes, which it can't because the second read is waiting on the lock....

DEAD LOCK FTW!

It only happens occasionally, since the write lock is only taken every 15 seconds, and the problem only occurs IF the write TAKING is exactly between those two read lock takings, since they are pretty close, it's rare.


Edit: Found a second dead lock (which is probably the real issue for mempool.space, the previous deadlock was only really an issue for blockstream due to their use of the registry endpoint.

The issue

  1. Every time you look up an asset, it calls query.mempool() which returns a RwLockReadGuard. Then later on (within a sub-function called with query passed in) query.mempool() is called again. This causes another possibility for the write() in the main loop (src/bin/electrs.rs) to occur between those two reads, causing the whole main loop to freeze.
  2. The write() in the main loop must coincide with an asset GET request. The write() is called about once every 500 - 1500 ms. (500ms wait for user signal + the time it takes to query bitcoind for new blocks and mempool txs.)
junderw commented 1 year ago

@shesek Might want to take a look at this too.

junderw commented 1 year ago

Tested using nuking script ACK 89e7fbc