Closed RubenKelevra closed 1 year ago
Thank you for submitting this PR! A maintainer will be here shortly to review it. We are super grateful, but we are also overloaded! Help us by making sure that:
The context for this PR is clear, with relevant discussion, decisions and stakeholders linked/mentioned.
Your contribution itself is clear (code comments, self-review for the rest) and in its best form. Follow the code contribution guidelines if they apply.
Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment. Next steps:
A maintainer will triage and assign priority to this PR, commenting on any missing things and potentially assigning a reviewer for high priority items.
The PR gets reviews, discussed and approvals as needed.
The PR is merged by maintainers when it has been approved and comments addressed.
We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution. We are very grateful for your contribution!
Hey @Stebalien, the previous settings are way too large to do sensible deduplication.
There's a study linked in https://github.com/ipfs/go-ipfs/issues/7966 which states, that 64 KByte is a good choice for deduplication when compression is applied too.
But that's for highly mixed data and with very limited memory per terabyte of storage.
We don't have such restrictions, users can choose buzhash blocked data like iso images, VM images uncompressed tar archives as well as HTML, CSS, and other text-based files.
With the new buzhash size those should deduplicate very well.
For any compressed or encrypted data format deduplication isn't possible anyway, so they can be stored in the default 256 K blocks and don't need the overhead of being stored with smaller blocks.
I agree we should change the defaults (although please also consider that we're trading off per-block overhead in terms of access times, network overhead, latencies, etc). However, if we're going to change the defaults, I'd like to let users revert to the old defaults (e.g., to ensure hashes converge with previously imported data).
I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.
Latency on disk shouldn't be an issue, since the average 8K should be still pretty neat in performance on an SSD. I mean, that's what a database usually uses as the block size. So operating systems are kind of optimized for that. Since we can do more deduplication means, we can do more efficient caching of the same amount of data. This might even lead to more performance, depending on the amount of duplicate data.
We also reduce the transfer size, by being able to fetch data from the local disk, which hasn't been changed.
For large file transfers, we might want to tweak the network settings to think in data size, not individual blocks. This way we would just fetch more blocks containing the same amount of data.
The target size of this new setting is 8K since as mentioned above that's what databases use which means the operating system is optimized for this size of requests.
When the blocks are stored in a database they would be stored sequentially anyway, so there's no real loss in write speed.
In terms of flatfs, we might hit some barriers since we probably do sync after each block has been written. So we might want to recommend turning off sync if there are performance issues.
Regarding old data: Buzhash is fairly new in ipfs and I don't think there are many users, especially since rabin gives a bigger deduplication ratio at the current settings.
I don't think we need backward compatibility here.
I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.
Probably only matters if you have enough information to make a guess. Please test ipfs cat
(downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference.
ipfs cat
currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this.Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency.
In terms of defaults, people depend on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here.
I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.
@Stebalien wrote:
I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.
Probably only matters if you have enough information to make a guess. Please test
ipfs cat
(downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference.
- Prefetching is hard. The smaller the blocks, the worse we are at prefetching.
ipfs cat
currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this. Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency.
Fair enough, but I think we can improve the performance since it's just about latency. So it's basically the same issue as TCP with a too-small receive window. We just have to send more data "blindly" basically. There's no reason it has to be (much) slower than larger chunks.
- Deduplication is not a given and killing perf for encrypted files just to help with files that deduplicate well is not an option.
True. But there's always the option to use just 256K blocks for these type of data (if not even larger). There's basically no point in using a rolling hash for these data to begin with.
So maybe fixed 256K chunks should be the default and if we're dealing with files we could switch the chunker based on the mime type of the file if no chunker is manually set by the user?
In terms of defaults, people depend on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here.
Okay.
I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.
Will do.
@Stebalien I'll first verify my assumptions that such small blocks are necessary for good deduplications with some sample data, before I proceed.
I just don't trust this Microsoft paper and still think 4-16 K is pretty much what we need.
But will we actually "snap" to 4K blocks, or 512 Bytes or will it output a random byte length?
@Stebalien I'm finally done - this took longer than expected.
TL;DR:
12k-24k-48k
and 16k-32k-64k
.LMK if there's any use-case I missed that would greatly benefit from deduplication on block-level. :)
ipfs rolling hash test - Sheet1.pdf
/ipfs/bafykbzacedsteoatejnapzdhnhoaiyc2y5wsypl6jsif4pf5uwjsztdsk3owa
Ah and fixed size 8K and 16K are actually pretty great for databases. So in theory we could recommend those for maximum amounts of deduplication of database data.
They even outperform rabin-4k-8k-16k
which means knowledge of the alignment of the data is better than simply throwing rolling hashsums against anything.
Awesome work! Were you able to test transfer times as well?
This would have taken too long for all options I've tested. I can run this test, but only for a limited number of cases.
Do you agree with my candidates?
@Stebalien, I forgot the ping :)
@Stebalien here's my pin time test:
ipfs rolling hash pin-time - Sheet1.pdf
/ipfs/QmPMsvgjoQHQPvw6vWjgHRvC3YddErgAacg9iGu4YLzQsp
Note that this is the worst-case scenario since this connection is plagued with buffer bloat. Additionally as the corpus is only one big file, there only large blocks. In my first test I had also folders with many small files in which case the results would be closer.
I only tested TCP, to get more reliable results, as UDP is sometimes deprioritized by providers.
Result for me: Go with 12k-24k-48k, as it's a fair compromise between small chunks and overhead. The performance penalty for transfers has nothing to do with processing power, just the missing handling of longer latencies. So it should be fixable.
ipfs pin add
or ipfs cat
(both behave differently and will perform differently).Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the mean-time.
Given the numbers I'm seeing, it sounds like 16k-32k-64k
is acceptable
(but I'd still need the answers to the rest of those questions to tell) (and this still needs to be configurable)
@Stebalien wrote:
- Is that the buzhash with new defaults or buzhash with 256Kib Blocks?
I've tested the latency with git commit which is in the profile of the node on the right.
- What was the latency?
The latency of the connection is in the document, once idle and once while downloading (with 256K blocks - which did hit the ceiling on this connection).
- How many blocks?
The file size is 742 MB, the average block size should determine how many blocks each chunker created, but I can look that up, if the exact number is important.
- Which command?
ipfs pin add
oripfs cat
(both behave differently and will perform differently).
I've used time ipfs pin add --progress <CID>
and then unpinned the ipfs repo gc
between the pin adds.
Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the meantime.
Given the numbers I'm seeing, it sounds like
16k-32k-64k
is acceptable
Okay
(but I'd still need the answers to the rest of those questions to tell) (and this still needs to be configurable)
Well, I think we shouldn't make this configurable. A new and an old buzhash should be enough. If everyone experiments with the settings, we end up with the same as with rabin, where basically every data has different chunker settings - which doesn't help the network-wide deduplication.
I actually think we should try to determine the mime type of the files added, and select static chunks or the smaller buzhash accordingly for the users. So compressed/encrypted or non-diff-duplicatable files get large chunks with good performance, and if we encounter an HTML/CSS/JS/SQL/XML/etc we switch to buzhash.
This could be named "adaptive", for example.
To continue this PR I think it makes sense to change buzhash
while buzhash-legacy
keeps the old chunking parameters. This way users will automatically switch to the new buzhash chunking, while everyone who needs to reproduce a CID from data can use the buzhash-legacy
chunker.
How does this sound?
@Stebalien wrote:
- How many blocks?
256k: 3136 blocks /ipfs/QmQ268dbzaWMLvweiDcRQXmapTUx1PKE5rivPsHPNXvgSY rabin-12k-24k-48k: 30644 blocks /ipfs/QmefWQeJHY5hN4BSJw7v1gC5m85gwGHAgUMNJ6nJXEFpk2 buzhash: 3279 blocks /ipfs/QmSSrnWmXcnST4jD4qnU9WNKaqEV6WNzKQEKzyYfobe37D rabin-16K-32K-64K: 19805 /ipfs/QmW7rCHh7QZH8Cqjeoum2f8Gzrh66rKydoT6BgWEhvqHnA
Note that I've done extensive testing of chunker sizes here;
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
And in my tests the optimal min-tgt-max sizes for a given avg chunk size is [avg/2]-[avg/2]-[avg*4]
. This is optimal for deduplication with good speed (cut-point-skipping). The optimal for speed with still good deduplication is around `[avg2/3]-[avg/3]-[avg4]. Yes, that is "min" is greater than "avg" AKA "tgt".
I note you didn't test using settings like this... it would be interesting to see how it compares.
This do NOT match what you are proposing.
Some observations about the rabin settings you were using in your testing in https://github.com/ipfs/go-ipfs-chunker/files/6819330/ipfs.rolling.hash.test.-.Sheet1.pdf
[min]-[tgt]-[max]
the average chunk size is actually min+tgt
, which is why I prefer using "target" or "tgt" for the middle setting, as using "avg" is misleading.max>=4*tgt+min
, or better max>=4*avg=4*(min+tgt)
. Setting max smaller than that seriously degrades deduplication for no wins (large chunks are less overhead). Even just using max=1M would probably be better. Its this degradation that probably explains why buzhash and rabin performed much worse than the fixed-size chunkers in some cases.If you want an average block size of 32K, I'd recommend 16K-16K-128K, and for an average block size of 64K (which is a sweetspot identified by Microsoft in their chunking tests) then I'd go with 32K-32K-256K. Though in both cases you can set the max higher if you want. If you want to compare against the default 256K average chunk size I'd use 128K-128K-1M.
@dbaarda wrote:
- Note that buzhash has an interesting corner-case that it always breaks at a run of 32 zeros. Things like tar use runs of zero padding between files so it often nicely breaks at file boundaries within a tar file. However, it also means long runs of zeros, like empty blocks in filesystem images, get turned into lots of min-size blocks.
Okay, but that doesn't really matter, as the CIDs are all the same, right?
So you get like 32391 same CIDs in a row.
--
Thanks for all the feedback. Def worth going over this again. :)
Sorry for the late reply: yes, all the min-size zero blocks have the same CID. The (maybe minor) problem is that this gives you more blocks, and thus the per-block metadata/etc overheads are worse. But that might be insignificant... I dunno.
Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.
Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.
Interesting observation, we could generate a new table for it to avoid that property. The primary property for the generation was keeping 50/50 split between one and zero bits in the whole table.
This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase.
You can learn more in the FAQs for the Boxo repo copying/consolidation effort.
Fixes https://github.com/ipfs/go-ipfs/issues/7966