namecoin / electrum-nmc

Namecoin port of Electrum Bitcoin client.
https://www.namecoin.org/
MIT License
29 stars 24 forks source link

Parallel blockchain sync can cause timeouts if network is too slow #268

Open JeremyRand opened 3 years ago

JeremyRand commented 3 years ago

If the network connection between Tor and the WAN is extremely slow, downloading multiple chunks simultaneously can cause each chunk download operation to time out. This causes the IBD to never advance. Some back-of-napkin math:

  1. Chunk 218 is 3559848 bytes when encoded as uncompressed JSON, according to https://github.com/namecoin/electrum-nmc-compression-test/blob/master/compression_results.csv#L219 .
  2. The default timeout for chunk download when a proxy is enabled seems to be NetworkTimeout.Generic.RELAXED, which is 45 seconds.
  3. Electrum-NMC tries to download chunks from 10 servers simultaneously.
  4. So, if the total network speed is less than 3559848 * 10 / 45 = 791077 B/s = 791 kB/s, and the checkpoint is sufficiently far back that Electrum-NMC needs to download at least 10 chunks, we can expect timeouts to be exceeded.
  5. More generally, the minimum network speed below which timeouts will be exceeded is 3559848 / 45 = 79108 B/s = 79.1 kB/s per chunk between the checkpoint height and the tip height.

Usually, this limit is not reached, because typically the sum of all Tor circuits' bandwidth limits is much higher than this. E.g. it's pretty common for even a single Tor circuit to be capable of handling over 800 kB/s, and using multiple Tor circuits tends to yield much faster speeds than this. However, this does not seem to hold for severely CPU-constrained systems (where Tor's CPU usage is the limiting factor); nor is it likely to hold for cases where unusually slow Tor bridges are in use (e.g. the public Meek bridges tend to be extremely slow, and all circuits must typically go through a single Meek bridge).

When we do reach this limit, the current behavior is basically the worst possible outcome: IBD does not advance at all, and we also saturate the network continuously. So, here's a very simple algorithm that should at least improve the situation:

  1. When Electrum-NMC boots, set global PARALLEL_IBD_CHUNKS to NUM_TARGET_CONNECTED_SERVERS (currently 10), and also set global OBSERVED_TIMEOUT_COUNT to 0.
  2. Define a function network.pending_downloading_chunks(), which returns len([index for index in network.pending_chunks if "res" not in network.pending_chunks[index]). This finds the number of chunks that are actively downloading, while ignoring chunks that have already downloaded but haven't yet been connected (perhaps because they depend on a lower-indexed chunk that isn't downloaded yet).
  3. Prior to requesting a chunk, set local INITIAL_PENDING_CHUNKS to max(PARALLEL_IBD_CHUNKS, network.pending_downloading_chunks()). (The latter argument will be higher if we recently decreased PARALLEL_IBD_CHUNKS; the former argument will be higher if we haven't yet started all the allowable chunk requests yet.) If INITIAL_PENDING_CHUNKS >= network.pending_downloading_chunks(), then sleep and check again before requesting the chunk.
  4. If a chunk request times out, catch the RequestTimedOut exception, and increment OBSERVED_TIMEOUT_COUNT. If OBSERVED_TIMEOUT_COUNT >= 3, set PARALLEL_IBD_CHUNKS to max(1, INITIAL_PENDING_CHUNKS // 2), and reset OBSERVED_TIMEOUT_COUNT to 0. Then re-raise the caught RequestTimedOut exception.
  5. If a chunk request completes successfully, reset OBSERVED_TIMEOUT_COUNT to 0.

The high-level effect of this algorithm is that we start with parallelization factor of 10, but if 3 consecutive timeouts occur before a single successful chunk download occurs (we pick 3 so that a single malicious server can't trick us into slowing down IBD), then we decrease the parallelization factor to 5. If it happens again, we decrease to 2. If it happens again, we decrease to 1.

The BandwidthRate config option in Tor seems like a passable way to test this in integration tests. It would certainly be better if we could do it with a local ElectrumX though. It looks like https://github.com/mistakster/throttle-proxy would be a good way to do it with a local ElectrumX (and without wasting resources of public Tor relays and public ElectrumX servers).