ipfs / go-ipfs-chunker

go-ipfs-chunkers provides Splitter implementations for data before being ingested to IPFS
MIT License
31 stars 37 forks source link

buzhash: reduce target size and cutoff size #31

Closed RubenKelevra closed 1 year ago

RubenKelevra commented 3 years ago

Fixes https://github.com/ipfs/go-ipfs/issues/7966

welcome[bot] commented 3 years ago

Thank you for submitting this PR! A maintainer will be here shortly to review it. We are super grateful, but we are also overloaded! Help us by making sure that:

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment. Next steps:

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution. We are very grateful for your contribution!

RubenKelevra commented 3 years ago

Hey @Stebalien, the previous settings are way too large to do sensible deduplication.

There's a study linked in https://github.com/ipfs/go-ipfs/issues/7966 which states, that 64 KByte is a good choice for deduplication when compression is applied too.

But that's for highly mixed data and with very limited memory per terabyte of storage.

We don't have such restrictions, users can choose buzhash blocked data like iso images, VM images uncompressed tar archives as well as HTML, CSS, and other text-based files.

With the new buzhash size those should deduplicate very well.

For any compressed or encrypted data format deduplication isn't possible anyway, so they can be stored in the default 256 K blocks and don't need the overhead of being stored with smaller blocks.

Stebalien commented 3 years ago

I agree we should change the defaults (although please also consider that we're trading off per-block overhead in terms of access times, network overhead, latencies, etc). However, if we're going to change the defaults, I'd like to let users revert to the old defaults (e.g., to ensure hashes converge with previously imported data).

RubenKelevra commented 3 years ago

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Latency on disk shouldn't be an issue, since the average 8K should be still pretty neat in performance on an SSD. I mean, that's what a database usually uses as the block size. So operating systems are kind of optimized for that. Since we can do more deduplication means, we can do more efficient caching of the same amount of data. This might even lead to more performance, depending on the amount of duplicate data.

We also reduce the transfer size, by being able to fetch data from the local disk, which hasn't been changed.

For large file transfers, we might want to tweak the network settings to think in data size, not individual blocks. This way we would just fetch more blocks containing the same amount of data.


The target size of this new setting is 8K since as mentioned above that's what databases use which means the operating system is optimized for this size of requests.

When the blocks are stored in a database they would be stored sequentially anyway, so there's no real loss in write speed.

In terms of flatfs, we might hit some barriers since we probably do sync after each block has been written. So we might want to recommend turning off sync if there are performance issues.


Regarding old data: Buzhash is fairly new in ipfs and I don't think there are many users, especially since rabin gives a bigger deduplication ratio at the current settings.

I don't think we need backward compatibility here.

Stebalien commented 3 years ago

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Probably only matters if you have enough information to make a guess. Please test ipfs cat (downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference.

  1. Deduplication is not a given and killing perf for encrypted files just to help with files that deduplicate well is not an option.
  2. Prefetching is hard. The smaller the blocks, the worse we are at prefetching. ipfs cat currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this.

Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency.

In terms of defaults, people depend on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here.

I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.

RubenKelevra commented 3 years ago

@Stebalien wrote:

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Probably only matters if you have enough information to make a guess. Please test ipfs cat (downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference.

  1. Prefetching is hard. The smaller the blocks, the worse we are at prefetching. ipfs cat currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this. Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency.

Fair enough, but I think we can improve the performance since it's just about latency. So it's basically the same issue as TCP with a too-small receive window. We just have to send more data "blindly" basically. There's no reason it has to be (much) slower than larger chunks.

  1. Deduplication is not a given and killing perf for encrypted files just to help with files that deduplicate well is not an option.

True. But there's always the option to use just 256K blocks for these type of data (if not even larger). There's basically no point in using a rolling hash for these data to begin with.

So maybe fixed 256K chunks should be the default and if we're dealing with files we could switch the chunker based on the mime type of the file if no chunker is manually set by the user?

In terms of defaults, people depend on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here.

Okay.

I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.

Will do.

RubenKelevra commented 3 years ago

@Stebalien I'll first verify my assumptions that such small blocks are necessary for good deduplications with some sample data, before I proceed.

I just don't trust this Microsoft paper and still think 4-16 K is pretty much what we need.

But will we actually "snap" to 4K blocks, or 512 Bytes or will it output a random byte length?

RubenKelevra commented 3 years ago

@Stebalien I'm finally done - this took longer than expected.

TL;DR:

LMK if there's any use-case I missed that would greatly benefit from deduplication on block-level. :)

ipfs rolling hash test - Sheet1.pdf

/ipfs/bafykbzacedsteoatejnapzdhnhoaiyc2y5wsypl6jsif4pf5uwjsztdsk3owa

RubenKelevra commented 3 years ago

Ah and fixed size 8K and 16K are actually pretty great for databases. So in theory we could recommend those for maximum amounts of deduplication of database data.

They even outperform rabin-4k-8k-16k which means knowledge of the alignment of the data is better than simply throwing rolling hashsums against anything.

Stebalien commented 3 years ago

Awesome work! Were you able to test transfer times as well?

RubenKelevra commented 3 years ago

This would have taken too long for all options I've tested. I can run this test, but only for a limited number of cases.

Do you agree with my candidates?

RubenKelevra commented 3 years ago

@Stebalien, I forgot the ping :)

RubenKelevra commented 3 years ago

@Stebalien here's my pin time test:

ipfs rolling hash pin-time - Sheet1.pdf

/ipfs/QmPMsvgjoQHQPvw6vWjgHRvC3YddErgAacg9iGu4YLzQsp

Note that this is the worst-case scenario since this connection is plagued with buffer bloat. Additionally as the corpus is only one big file, there only large blocks. In my first test I had also folders with many small files in which case the results would be closer.

I only tested TCP, to get more reliable results, as UDP is sometimes deprioritized by providers.


Result for me: Go with 12k-24k-48k, as it's a fair compromise between small chunks and overhead. The performance penalty for transfers has nothing to do with processing power, just the missing handling of longer latencies. So it should be fixable.

Stebalien commented 3 years ago
  1. Is that the buzhash with new defaults or buzhash with 256Kib Blocks?
  2. What was the latency?
  3. How many blocks?
  4. Which command? ipfs pin add or ipfs cat (both behave differently and will perform differently).

Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the mean-time.

Given the numbers I'm seeing, it sounds like 16k-32k-64k is acceptable

Stebalien commented 3 years ago

(but I'd still need the answers to the rest of those questions to tell) (and this still needs to be configurable)

RubenKelevra commented 3 years ago

@Stebalien wrote:

  1. Is that the buzhash with new defaults or buzhash with 256Kib Blocks?

I've tested the latency with git commit which is in the profile of the node on the right.

  1. What was the latency?

The latency of the connection is in the document, once idle and once while downloading (with 256K blocks - which did hit the ceiling on this connection).

  1. How many blocks?

The file size is 742 MB, the average block size should determine how many blocks each chunker created, but I can look that up, if the exact number is important.

  1. Which command? ipfs pin add or ipfs cat (both behave differently and will perform differently).

I've used time ipfs pin add --progress <CID> and then unpinned the again and run an ipfs repo gc between the pin adds.

Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the meantime.

Given the numbers I'm seeing, it sounds like 16k-32k-64k is acceptable

Okay

RubenKelevra commented 3 years ago

(but I'd still need the answers to the rest of those questions to tell) (and this still needs to be configurable)

Well, I think we shouldn't make this configurable. A new and an old buzhash should be enough. If everyone experiments with the settings, we end up with the same as with rabin, where basically every data has different chunker settings - which doesn't help the network-wide deduplication.

I actually think we should try to determine the mime type of the files added, and select static chunks or the smaller buzhash accordingly for the users. So compressed/encrypted or non-diff-duplicatable files get large chunks with good performance, and if we encounter an HTML/CSS/JS/SQL/XML/etc we switch to buzhash.

This could be named "adaptive", for example.

RubenKelevra commented 3 years ago

To continue this PR I think it makes sense to change buzhash while buzhash-legacy keeps the old chunking parameters. This way users will automatically switch to the new buzhash chunking, while everyone who needs to reproduce a CID from data can use the buzhash-legacy chunker.

How does this sound?

RubenKelevra commented 3 years ago

@Stebalien wrote:

  1. How many blocks?

256k: 3136 blocks /ipfs/QmQ268dbzaWMLvweiDcRQXmapTUx1PKE5rivPsHPNXvgSY rabin-12k-24k-48k: 30644 blocks /ipfs/QmefWQeJHY5hN4BSJw7v1gC5m85gwGHAgUMNJ6nJXEFpk2 buzhash: 3279 blocks /ipfs/QmSSrnWmXcnST4jD4qnU9WNKaqEV6WNzKQEKzyYfobe37D rabin-16K-32K-64K: 19805 /ipfs/QmW7rCHh7QZH8Cqjeoum2f8Gzrh66rKydoT6BgWEhvqHnA

dbaarda commented 2 years ago

Note that I've done extensive testing of chunker sizes here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

And in my tests the optimal min-tgt-max sizes for a given avg chunk size is [avg/2]-[avg/2]-[avg*4]. This is optimal for deduplication with good speed (cut-point-skipping). The optimal for speed with still good deduplication is around `[avg2/3]-[avg/3]-[avg4]. Yes, that is "min" is greater than "avg" AKA "tgt".

I note you didn't test using settings like this... it would be interesting to see how it compares.

This do NOT match what you are proposing.

dbaarda commented 2 years ago

Some observations about the rabin settings you were using in your testing in https://github.com/ipfs/go-ipfs-chunker/files/6819330/ipfs.rolling.hash.test.-.Sheet1.pdf

If you want an average block size of 32K, I'd recommend 16K-16K-128K, and for an average block size of 64K (which is a sweetspot identified by Microsoft in their chunking tests) then I'd go with 32K-32K-256K. Though in both cases you can set the max higher if you want. If you want to compare against the default 256K average chunk size I'd use 128K-128K-1M.

RubenKelevra commented 2 years ago

@dbaarda wrote:

  • Note that buzhash has an interesting corner-case that it always breaks at a run of 32 zeros. Things like tar use runs of zero padding between files so it often nicely breaks at file boundaries within a tar file. However, it also means long runs of zeros, like empty blocks in filesystem images, get turned into lots of min-size blocks.

Okay, but that doesn't really matter, as the CIDs are all the same, right?

So you get like 32391 same CIDs in a row.

--

Thanks for all the feedback. Def worth going over this again. :)

dbaarda commented 2 years ago

Sorry for the late reply: yes, all the min-size zero blocks have the same CID. The (maybe minor) problem is that this gives you more blocks, and thus the per-block metadata/etc overheads are worse. But that might be insignificant... I dunno.

Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.

Kubuxu commented 2 years ago

Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.

Interesting observation, we could generate a new table for it to avoid that property. The primary property for the generation was keeping 50/50 split between one and zero bits in the whole table.

hacdias commented 1 year ago

This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase.

You can learn more in the FAQs for the Boxo repo copying/consolidation effort.