Streaming Merkle root computation in the Renter-Host Protocol

lukechampine commented 4 years ago

Currently, we encrypt all messages in the RHP with ChaCha20-Poly1305. This means that sector data is encrypted twice. The encryption itself adds some overhead, but not enough to worry about. However, treating sector as an opaque message introduces a bottleneck in the protocol: we can't start computing the Merkle root of the sector until we've received and verified the entire message!

This affects both the renter and the host: when uploading, the host needs to wait for the entire sector to arrive before it can compute its Merkle root; when downloading, the same applies to the renter. Imagine you have 1 Gbps of bandwidth, and computing a Merkle root takes 20ms. The total transfer time for a sector is now (4 MiB / 1 Gbps = 33ms) + 20ms = 53 ms, which means your throughput is 4 MiB / 53ms = 633 Mbps. But if the Merkle root could be computed in streaming fashion as the sector arrived, the total transfer time would be much closer to 33ms. So we're looking at a potential 58% speed-up here!

But -- how do we accomplish this? At first I thought we would need to add a new RPC that sends the sector un-encrypted. But on further thought, I think we can actually pull it off without changing the protocol at all! Here's why. ChaCha20-Poly1305 just means encrypting with ChaCha20 (a stream cipher) and appending a poly1305 tag. You don't need to verify the tag before decrypting the data, you're just supposed to. In our case, we're not doing anything dangerous with the plaintext; we're just computing its Merkle root. So it's fine if we do a streaming decryption, pipe that into our Merkle root function, and verify the tag in parallel. (Actually, I think verifying the tag may be unnecessary too, since the root itself serves as a tag.)

MeijeSibbel commented 4 years ago

That's a novel idea and definitely worth exploring. Any speed-up during transfers is very welcome.

jkawamoto commented 4 years ago

Looks nice!

I also agree that we don't need to verify the Poly1305 tag as long as verifying the Merkle root succeeds.

lukechampine commented 4 years ago

Some initial results on this:

I implemented "sector streaming" for downloads, since it was slightly easier than uploads. To my dismay, benchmarks showed no significant increase in throughput: when downloading from a single host on the same machine, downloads run at about 800 Mbps with or without streaming.

I dug a little deeper, and realized that, since the benchmark was conducted locally, the effective network bandwidth was 6 Gbps! At that speed, computing the Merkle root in parallel will only save a few milliseconds. So, to more accurately simulate real network conditions, I introduced artificial slowdowns in the connection. I tested artificial speeds of 100 Mbps, 500 Mbps, 1 Gbps, and 2 Gbps. Results are summarized below:

	100 Mbps	500 Mbps	1 Gbps	2 Gbps	6 Gbps
No streaming	64 Mbps	240 Mbps	384 Mbps	448 Mbps	800 Mbps
Streaming	72 Mbps	344 Mbps	672 Mbps	752 Mbps	800 Mbps

This table is much more gratifying. We can see that sector streaming is about 1.5x faster than non-streaming, except at the extremes. This makes sense: if the network is very slow, then the Merkle root overhead (~20ms) does not add much to the total transfer time. Conversely, if the network is very fast, then the Merkle root begins to dominate the total transfer time, making it irrelevant whether we stream or not.

Next, I will implement sector streaming for uploads. I expect to achieve similar results there; currently, on a 1 Gbps connection, the maximum upload speed is about 330 Mbps (as evidenced by local testing with ghost), so I estimate that we could increase that to about 600-700 Mbps with sector streaming.

lukechampine commented 4 years ago

It occurred to me that with this change, hashing is now the last computational bottleneck remaining. With a faster hash function, we should be able to dramatically increase maximum throughput. So for fun, I swapped out our BLAKE2b Merkle root calculation with calls to XChaCha20, which should be about the same speed as SIMD-optimized BLAKE3. Consequently, I was able to achieve download speeds of 2400 Mbps. 🔥

Unfortunately, we can't switch to BLAKE3 without breaking compatibility, and even if we wanted to, there's no SIMD-optimized implementation for Go. It might, however, be possible to further optimize BLAKE2b for our purposes. It won't be nearly as fast, but I imagine we could get a 2x-4x speedup in hashing speed, which should be enough to get us over 1 Gbps per host.

MeijeSibbel commented 4 years ago

1 Gbps per host sounds amazing and is probably far enough since most hosts don't have home connections faster than 1 Gbps. That said, what has to change on the host side to be able to get that speed? Currently the host code limits throughput significantly (~40 Mbit/s). I'm also wondering what other latency effects might severely impact this optimization.

lukechampine commented 4 years ago

Hosts are currently limited by their own non-streaming Merkle root computations, as well as the 500ms sync interval that we discovered previously. It will likely be easier to achieve 1Gbps when downloading, as there are no fsyncs required and (if the renter is requesting a full sector) no Merkle proof needs to be computed. I'll try benchmarking it later.

lukechampine commented 4 years ago

Update: I decided to try my hand at writing optimized assembly to compute Merkle roots. After much wailing and gnashing of teeth, I have emerged from the darkness with an AVX2 implementation that can compute the Merkle root of a sector in just 2 milliseconds -- 10x faster than Sia today.

There's just one catch...my implementation computes the BLAKE3 Merkle root, not the BLAKE2b Merkle root. Sorry. BLAKE3, for various reasons, is easier to write assembly for than BLAKE2b. However, BLAKE3 and BLAKE2b have very similar designs (big surprise), so I now feel more confident that I could write an optimized BLAKE2b Merkle root function. BLAKE2b is slower than BLAKE3, so I probably won't get it down to 2 milliseconds, but it'll be closer to 2 than 20.

lukechampine commented 4 years ago

Success!! I now have an AVX-2 BLAKE2b Merkle root implementation that runs in 5ms, a full 4x speedup over the status quo. As expected, not quite 2ms, but close enough. 🔥 🔥 🔥

The crazy thing is that this function actually runs faster than the following code:

blake2b.Sum256(sector[:])
blake2b.Sum256(sector[:])

i.e. running the stdlib blake2b function (which is also AVX2-optimized) on the same amount of data that the Merkle root function has to process (2x the tree size). This is possible because when computing Merkle roots, we know that 63 bytes of each 128-byte block will be 0, so we can skip them entirely.

Another nice thing is that my implementation is written to scale with the size of the SIMD vectors. So it would not be too arduous to write an AVX-512 version that yields yet another 2x speedup (i.e. ~2.5ms).

The only bummer is that this is platform-specific code, so it will only benefit users running amd64. People running Sia on ARM or other archs won't see any speedup, at least until someone ports the SIMD code to their platform.

I'll open a PR to merge this sometime next week. 🎉

lukechampine / us

Streaming Merkle root computation in the Renter-Host Protocol #29