Open adlrocha opened 3 years ago
I don't want to detract from your larger point on protocol mismatch but these numbers must be looked at with some care.
I just tried reading, gziping, and writing to disk a 20 MB file on my machine (virtualised, under load) and it took 0.08 s for highly redundant data and 0.6 s for random data. This is using -9 for the worst-case scenario... In reality, you'd probably start with a much lower level for on-the-fly compression.
Looking at your plot here and for a ~20 MB file, you're reporting an increase of ~20 s in the full message case and ~6 s in the block compression case (vs. the uncompressed baseline). The times are not directly comparable but that's still an order of magnitude difference and doesn't seem entirely algorithmic -- implementation considerations probably have a huge influence. Since compressed communication is always a trade-off (it trades off link capacity and transmission time for computational capacity and CPU time), it is very sensitive to how fast the implementation is. I don't think a naive implementation allows us to draw definitive conclusions.
Looking into it a little more, it appears the Go standard library gzip is pretty slow (or at least was in 2015). It may be worth it to just quickly benchmark the gzip implementation in isolation and/or try alternatives.
The other side of the experiment, of course, is that using compression will never look good in evaluation unless you're evaluating the send time over a constrained link. I don't know if we have any sort of numbers for average upload speed across IPFS, but using something like 10 mbps may not be terribly unfair. (I assume this is not currently being done as the plot says "bandwidth 0", though that doesn't explain why it takes 3 s to transfer a 20 MB file in the baseline case...)
All that being said, there's a good chance even an optimised implementation will not suffice to bridge the gap.
I just realized I left this thread unanswered. A few updates on this compression matter:
StreamCompresson
strategy in the scope of the RFC that wraps compression into the Bitswap stream leading to way better performance than the protocol-level BlockCompression
and FullCompression
strategies.Compression
transport into libp2p (between the Muxer
and the Security
layer) so that every stream running over a libp2p node can potentially benefit from the use of compression. This is a non-breaking change as the transport-upgrader
has also been updated to enable compression negotiation (so eventually anyone can come with their own compression and embed it into libp2p seamlessly). Some repos to get started with compression in libp2p:
As part of RFC|BB|L203A of the Beyond Bitswap project we are exploring the use of compression within Bitswap. In our initial explorations we have implemented two main strategies:
BlockCompression
: It compresses theRawData
of blocks using GZip.FullCompression
: It compresses the full Bitswap message using GZip and adds it the newCompressedPayload
of a new Bitswap message.Thus, two additional fields have been added to Bitswap's protobuf message:
CompressedPaylaod
including the compressed payload of the original Bitswap message ifFullCompression
is enabled, and theCompressionType
flag to signal the compression strategy used.Initial tests show an approximate x10 overhead by the use of compression compared to Bitswap without compression. We compared our GZip compression approach to existing implementations of GZip compression http handlers in golang, and the main difference is that GZip http handlers pipe the compression writer to the http server writer streaming the compressed bytes directly to the client. In our case, we can't use stream compression for several reasons:
CompressionType
field in the protobuf so we can be backward compatible. This is not a stopper because we could easily use a multicodec to signal that the stream is compressed.go-bitswap
where the libp2p protocol stream is referenced and written inmessage.go
, there is no easy way to pipe the compressed stream to the protocol stream writer (this is probably fixable once we figure out how to avoid the required size prefix).As a result of all of this evaluations we want to open a discussion on the following topics:
KeepReading
andEndOfStream
signals in protocol streams so we don't have to know message sizes beforehand and we can pipe streams (such as what we were looking to do for stream compression).RawData
compressed in the datastore removing the need of "on-the-fly" compression.