ipfs-shipyard / ipfs-npm-registry-mirror

Clone the NPM registry into IPFS
19 stars 11 forks source link

What would it look like to use rabin encoded tar files instead of tarballs? #6

Open mikeal opened 5 years ago

mikeal commented 5 years ago

I brought this up on the JS Core meeting today.

The problem with tarballs is that, even if the data inside them is similar, it is never de-duplicated. This lead me to explore what it might look like to use rabin to store package tar files instead of the compressed tarballs.

https://github.com/mikeal/ipfs-npm-rabin-test

The code here is doing a simulation of creating the graph for a single package and then comparing them. It uses an implementation of the new unixfsv2, which is using dag-cbor so it isn't exactly the same as the current IPFS implementation but close enough for investigation.

In short, using rabin encode tar files is mostly slower and larger.

The problem is, whatever you save on deduplication is lost in the lack of compression. I ran a test on my request package to get some preliminary numbers.

The average difference between one version and the next is 104749 bytes in rabin, while the average tarball is only 62525 bytes. On average, there's about 8 more blocks in the rabin encoding as well, which right now would make this much slower.

The total size of the rabin graph is 23677255 compared to only 7815709 for the tarball graph. So, even in the aggregate with all the savings from deduplication in every release, it doesn't make up for the difference in compression.

That doesn't mean this is a dead end, it just means that several other things would need to happen in order for this to better/faster.

All of this is currently under discussion but this means that a lot of stuff needs to line up order for this to be worth it.

achingbrain commented 5 years ago

This is great research, thanks for bottoming it out

achingbrain commented 5 years ago

Also completely not what I would have predicted!

achingbrain commented 5 years ago

@mib-kd743naq this is the issue I mentioned at the IPFS meetup at FOSDEM when we were talking about storing npm.