Closed DonaldTsang closed 1 year ago
Are we sure that the chunking algorithm in question is already optimized, and that using SIMD instructions would not give the chunking algorithm any speed-up? (e.g. https://github.com/kevinko/rabin)
The rabin library we're using has terrible performance but that's unrelated to the algorithm or SIMD: it allocates all over the place.
We've added a chunker that uses buzhash and is almost as fast as the default chunker. Given the overhead of generating CIDs, any overhead from chunking using buzhash is now negligible.
Try using the latest go-ipfs master and passing --chunker=buzhash
on add.
Can the deduplication power of the chunking algorithm can be demonstrated using a text corpus?
Yes?
Can we set the record straight that the chunking algorithm is not meant for file compression but file deduplication?
Yes?
Can we standardize it such that old methods can be phased out once the min/max bounds are set?
I'm not sure what you're asking.
Are "semantic chunking" even considered at this point, since every file has their own type of "chunk-defining marker"?
Yes. The goal is to have a good general-purpose chunker and a bunch of per-file-type chunkers (tar, media containers, etc.). See:
Try using the latest go-ipfs master and passing --chunker=buzhash on add.
Is this the best hash algorithm within a basket of alternate chunking algorithms? If so where is the list of assessment of existing algorithms and library benchmarks? If not, is it possible to start a benchmark?
I'm not sure what you're asking.
Phasing out both the old methods of static chunk sizes, and the old method of chunker that allows for different minimum and maximum chunk sizes based on parameters, as people can hash the same file in multiple methods (assuming there is one hashing algorithm but multiple chunking algorithms).
So IPLD will handle different file types? Okay. In that case we need to cover as many "common" file formats as possible, that might pose a few issues though... File obfuscation might make a single file have multiple chunking forms depending on how people interprets the files.
Pretty much having fun, nice to know
Is this the best hash algorithm within a basket of alternate chunking algorithms?
Almost certainly not. It was just a reasonably simple algorithm to get the job done.
@ribasushi will be working on better chunking algorithms.
Phasing out both the old methods of static chunk sizes, and the old method of chunker that allows for different minimum and maximum chunk sizes based on parameters, as people can hash the same file in multiple methods (assuming there is one hashing algorithm but multiple chunking algorithms).
We'll eventually be changing the default chunking algorithm but we'll always provide options for users that need them.
So IPLD will handle different file types?
Eventually, yes. However, we have a ways to go before we're even going to consider tackling this.
@ribasushi will be working on better chunking algorithms.
Ask him to rim through all the literature if possible. Better yet web surf repos from GitHub for inspiration.
We'll eventually be changing the default chunking algorithm but we'll always provide options for users that need them.
Standardization is my main worry.
@DonaldTsang this is definitely the plan, worry not. Check back at https://github.com/ipfs/specs/issues/227 for various PR's / changes to the issue text for progress of scoping the work in the next week or so.
Does anyone know what the impact/cost of different chunk sizes on IPFS is? Smaller chunks will give much better de-duplication, but at the cost of adding way more blocks. At some point the de-duplication benefits will be outweighed by the per-block overheads. In the opposite direction, large blocks can also have overheads from having to handle/transfer such large blocks.
Regardless of what chunking algorithm to use, we need to know this to tune the chunkers. And every time the chunkers are tuned to different values, they will break de-duplication against older chunked data, so it would be good to get these right sooner rather than later.
I believe there is a current max block size of 2MB, and the default chunker uses 256KB. This suggests that 256KB is some kind of sweet spot for block sizes assuming a fixed sized chunker. For a variable chunker I would expect the sweet-spot average to be smaller than that to get extra de-duplication benefits.
Currently the buzzhash chunker is hard-coded for a min/avg/max chunk size of 128KB/256KB/512KB. I suspect 64KB/128KB/512KB or 64KB/192KB/512KB might be better.
However, it really does depend on how the rest of IPFS scales with blocksize. In some ways an upper blocksize of 2MB is quite small... Google's original GFS cluster-filesystem used 64MB as the chunk size.
@dbaarda very soon I will be publishing a call to the community to play exactly with this type of tuning. You happen to ask this question exactly at the end of a journey that started back in feb to build a platform for evaluating this very question. The tooling for this is nearly ready (I've been saying this for a while, but this time it's for real 😸 ). You could play with it already, but the repository will be force-pushed a couple times this week still, you've been warned. Also there is a number of holes in the helptext, check back in a day, plus there is a much more elaborate writeup coming up very soon on what is and isn't useful to explore.
You can follow the instructions here: https://github.com/ipfs-shipyard/DAGger#readme An example run:
$ ls -l ~/data/cp/lin/*.zst
-rw-r--r-- 1 admin staff 163725311 Dec 21 11:12 /Users/admin/data/cp/lin/linux-5.4.6.tar.zst
-rw-r--r-- 1 admin staff 163719892 Dec 31 17:12 /Users/admin/data/cp/lin/linux-5.4.7.tar.zst
-rw-r--r-- 1 admin staff 163725207 Jan 4 19:24 /Users/admin/data/cp/lin/linux-5.4.8.tar.zst
$ tmp/maintbin/dezstd ~/data/cp/lin/*.zst | go run ./cmd/stream-dagger/ --multipart --ipfs-add-compatible-command="--cid-version=1"
{"event": "root", "payload": 938209280, "stream": 1, "cid":"bafybeiaechcvaf6fgd7yzwgpmnab3uz2ukfnp5tfcqkg3zpvbsqz2vxqoi", "wiresize": 938389542 }
{"event": "root", "payload": 938260480, "stream": 2, "cid":"bafybeigf6urjkscpvw65gaazjxuqksy7gnydvrei4utg6ih4doe4blajsy", "wiresize": 938440792 }
{"event": "root", "payload": 938280960, "stream": 3, "cid":"bafybeih3naz4652s2pf4bkrszlp5cpdgb56p5oawwdogpvyjodeecnxbqa", "wiresize": 938461272 }
Processing took 6.03 seconds using 2.30 vCPU and 53.39 MiB peak memory
Performing 48,616 system reads using 0.17 vCPU at about 445.38 MiB/s
Ingesting payload of: 2,814,750,720 bytes from 3 substreams
Forming DAG covering: 2,815,291,606 bytes across 10,805 nodes
Dataset deduped into: 2,812,653,568 bytes over 10,731 unique leaf nodes
Linked as streams by: 540,886 bytes over 66 unique DAG-PB nodes
Taking a grand-total: 2,813,194,454 bytes, 99.94% of original, 1.0x smaller
Roots\Counts\Sizes: 3% 10% 25% 50% 95% | Avg
{3} 3 L2: 1,102 1,102 1,102 | 1,102
63 L1: 5,010 8,710 8,710 8,710 8,710 | 8,533
10,731 DB: 262,144 262,144 262,144 262,144 262,144 | 262,105
$ tmp/maintbin/dezstd ~/data/cp/lin/*.zst | go run ./cmd/stream-dagger/ --multipart --ipfs-add-compatible-command="--cid-version=1" --chunkers="pad-finder_max-pad-run=65535_pad-static-hex=00_static-pad-min-repeats=42_static-pad-literal-max=1285__buzhash_hash-table=GoIPFSv0_state-target=0_state-mask-bits=17_min-size=131072_max-size=524288"
{"event": "root", "payload": 938209280, "stream": 1, "cid":"bafybeiejrrulnq75wzpjqtm3lyd6hacafalsyguikrz4teqgi6hdcbud7a", "wiresize": 962730332 }
{"event": "root", "payload": 938260480, "stream": 2, "cid":"bafybeidoq22qgydfh7nugg5zvxvi2jdx77lzc7vcu2yrykyhjpowlllete", "wiresize": 962782580 }
{"event": "root", "payload": 938280960, "stream": 3, "cid":"bafybeifqehqnrfnzhiofeus3nx27crukfunimessrsmtn5n2mlkdhsvgle", "wiresize": 962802415 }
Processing took 22.16 seconds using 2.60 vCPU and 506.74 MiB peak memory
Performing 49,750 system reads using 0.59 vCPU at about 121.15 MiB/s
Ingesting payload of: 2,814,750,720 bytes from 3 substreams
Forming DAG covering: 2,888,315,327 bytes across 1,569,465 nodes
Dataset deduped into: 1,047,604,722 bytes over 443,490 unique leaf nodes
Linked as streams by: 73,564,607 bytes over 9,027 unique DAG-PB nodes
Taking a grand-total: 1,121,169,329 bytes, 39.83% of original, 2.5x smaller
Roots\Counts\Sizes: 3% 10% 25% 50% 95% | Avg
{3} 3 L3: 946 946 946 | 946
54 L2: 1,610 8,710 8,710 8,710 8,734 | 8,317
8,970 L1: 8,123 8,129 8,135 8,143 8,223 | 8,150
474 PB: 56 89 160 279 492 | 283
443,016 DB: 31 38 47 57 10,280 | 2,364
Wow, that's an impressive amount of effort to test chunkers. I probably wouldn't have gone that far :-)
I suspect that the chunking algorithms themselves probably won't make much difference, and will mostly all perform pretty similarly, so you probably just want to choose the simplest/smallest/fastest and then standardize on it. Too many different chunkers will break de-duplication more than a single standard chunker that's only average.
Far more important is the min/avg/max chunksize they produce. You only add and chunk files once, and then they get fetched and read repeatedly. The chunk size affects the amount of de-duplication and impacts the performance and network overheads every time they are read. IMHO its far more important to find the sweet spot chunk size compromise between deduplication and performance than picking the best chunker.
Looking at your example stream-dagger runs above I have a few questions;
In your first run am I right in assuming that the "10,805 nodes" includes DAG nodes, or are those leaf nodes and the second "10,731 unique leaf nodes" shows you actual had 74 nodes de-duplicated with the standard 256K chunker? I would be pretty surprised if it found any dupes.
In the second run, are you using 2 chunkers at once? The first looking for runs of at least 42 zero bytes with some kind of upper chunksize of 64K, followed by a buzhash with min/avg/max chunksize of 128K/256K/512K? If so the buzhash will do nothing, since the first chunker is already breaking it up into smaller pieces than the buzhash would.
The first chunker looking for zero padding will break the tar file up into files (tar files use zero padding to round each file up to a multiple of 512 bytes). I don't understand all of the static-pad args, but they look like it sets an upper limit of 64K chunks. The "2,888,315,327 bytes across 1,569,465 nodes" gives an average node size of less than 1.8KB, That's waaay smaller than the default 256KB for the current default chunker. How happy is IPFS with chunks that small? Chunks that small will do a pretty good job of finding duplicates though.
Having just said I don't think the actual chunker is going to matter much, I was about to contribute a Polynomial rolling hash chunker (sometimes AKA RabinKarp, but NOT the same as Rabin Fingerprint) :-) It's very similar to buzhash AKA CyclicPoly but a little simpler because it doesn't need a byte mapping table, and in my testing for librsync gave a much better hash. I'm unconvinced that it will make much difference performance wise for chunking, particularly for the tiny 32 byte rolling window (librsync uses it over whole 1K+ blocks), but it has simplicity on it's side.
I was going to submit this as a pull request against ipfs/go-ipfs-chunker, but I see in your new ipfs-shipyard/DAGger you've changed the API. If I was going to submit this, where would you like me to submit it?
Wow, that's an impressive amount of effort to test chunkers.
It isn't though. First off this doubles as an exploration of how to arrange stream processing to "abstract away" the overhead of chunking. If you do a few tests with larger entities you will notice how things are notably quicker.
Additionally it is becoming increasingly clear that for the "thin waist IPLD" concept from way way back to really work, there needs to be a "sufficiently universal" chunking algorithm that ~80% of implementations can simply default to. Global emergent convergence is a cool thing to at least try to attain.
I suspect that the chunking algorithms themselves probably won't make much difference
You are correct on algorithms standalone not providing much value. The key here is that combinations of them provide astonishing results. You already intuitively figured out what the padfinder does in the context of tar files, now stretch this further. This is another reason the framework exists: it had to be validated that recursive chunking can work over streams in limited memory AND that it can work with essentially no overhead ( each subchunk happens in a separate thread ) AND that the theoretical results do indeed show up in the aggregation tables.
IMHO its far more important to find the sweet spot chunk size compromise between deduplication and performance
Bingo, but add a sufficiently advanced chunk-stack to that shopping list ;)
- In your first run am I right in assuming that the "10,805 nodes" includes DAG nodes,
Yes, it's the amount of nodes you need to traverse to reassemble the stream. I will change the text to read locigical nodes
and add a newline to separate them visually.
- the buzhash will do nothing, since the first chunker is already breaking it up into smaller pieces than the buzhash would
Not exactly, gives me an idea how to rephrase the option names. TLDR: the first chunkers max-es deal with "pad runs", anything between those is simply passed onto the next chunker whatever it may be: https://github.com/ipfs-shipyard/DAGger/blob/824420aeedbc0afbb110ebe179d2532d352f6ebf/internal/dagger/ingest.go#L569-L602
- How happy is IPFS with chunks that small?
That's part of this exploration. It's just step+2 (currently in the process of wrapping up the tool so folks like yourself can play with it, and then adding the ingestion framework to go-ipfs to make it work faster on "re-hashing" ops, provided the hypothesized numbers check out in a week or so)
I was about to contribute a Polynomial rolling hash chunker ... I'm unconvinced that it will make much difference performance wise for chunking, particularly for the tiny 32 byte rolling window (librsync uses it over whole 1K+ blocks), but it has simplicity on it's side.
You are the exact customer this project has in mind ;) Unfortunately you are still just a wee bit too early: check back Monday-ish when I will have the repo "open for business". But yes - the chunker interface is what we will end up using in the end, and it is pretty stable at this point. Specifically https://github.com/ipfs-shipyard/DAGger/tree/master/internal/dagger/chunker/fixedsize is all you need to implement and then hook it up here: https://github.com/ipfs-shipyard/DAGger/blob/master/internal/dagger/core.go#L33-L39
Thanks for the questions! More - very soon.
- the buzhash will do nothing, since the first chunker is already breaking it up into smaller pieces than the buzhash would
Not exactly, gives me an idea how to rephrase the option names. TLDR: the first chunkers max-es deal with "pad runs", anything between those is simply passed onto the next chunker whatever it may be: https://github.com/ipfs-shipyard/DAGger/blob/824420aeedbc0afbb110ebe179d2532d352f6ebf/internal/dagger/ingest.go#L569-L602
But nearly everything between those "pad runs" is 1.8KB on average which is about the average file size in the linux kernel, and the buzhash those args will not break anything less than 128K into smaller chunks. So there is nothing left big enough after the pad-chunking for the buzhash to chunk any smaller.
Note that you can also make buzhash (or polyhash) chunk on zero runs without needing a pad-chunker. You just need to set state-target
to the buzhash of 32 zero's, which interestingly with the hash-table mapping it to0x6236e7d5 gives a buzhash of 0 (is that by design?). Normally you would avoid doing that so that long-runs of zeros (common for eg sparse files or almost empty filesystem images) don't get turned into minimum-sized chunks, but if your min size is acceptably large it should be fine. So the following would give you a similar result as you got with the pad-chunker in a single chunker;
buzhash_hash-table=GoIPFSv0_state-target=0_state-mask-bits=17_min-size=512_max-size=524288
Which gives a chunk min/avg/max of 512/128K/512K with the "added bonus" that it will chunk on any run of 32 zeros at least 512 bytes apart, which means it will chunk at (nearly) every file boundary in a tar file.
@dbaarda sorry for the late reply, had to take a pause from all of this.
But nearly everything between those "pad runs" is 1.8KB on average which is about the average file size in the linux kernel
And this is key: remember, you are not storing "tars". You are first and foremost storing individual files which happen to be organized in higher-order structures ( tar files ). Network-wide convergence will take shape naturally IFF we orient towards "nameable units", as opposed to "just data".
So the following would give you a similar result ...
This is why this tool was built - numbers beat a conjecture ;)
Taking a grand-total: 1,121,169,329 bytes, 39.83% of original, 2.5x smaller
Taking a grand-total: 1,313,530,216 bytes, 46.67% of original, 2.1x smaller
The 40% vs 47% difference may not seem like much, but this is just 3 files.
Taking a grand-total: 1,074,222,446 bytes, 38.16% of original, 2.6x smaller
Of course this is all super-unscientific and off-the cuff. I am spending the next several days putting together actual rigorous numbers from a number of datasets so I (or others 👀 ) can figure out what works and what doesn't.
I am also almost done adding a better breakdown of what went towards "data" and what is "padding" to get better indication of where "blocks are spent".
@dbaarda sorry for the late reply, had to take a pause from all of this.
No worries... I'd rather you actually do the work than talk about it anyway :-)
But nearly everything between those "pad runs" is 1.8KB on average which is about the average file size in the linux kernel
And this is key: remember, you are not storing "tars". You are first and foremost storing individual files which happen to be organized in higher-order structures ( tar files ). Network-wide convergence will take shape naturally IFF we orient towards "nameable units", as opposed to "just data".
See my summary and thoughts related to that here;
https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda
TLDR; if you care about identifying/deduping the contents of container-files like tar, perhaps the best solution is to make a file-type aware layer on top that extracts/transcodes/re-packs the contents... but there are reasons why you might not want to do that.
This is why this tool was built - numbers beat a conjecture ;)
Original non-tuned padfinder run ( same as earlier ) Taking a grand-total: 1,121,169,329 bytes, 39.83% of original, 2.5x smaller
... Forming DAG covering: 2,888,315,327 bytes of 1,569,465 logical nodes Dataset deduped into: 1,047,604,722 bytes over 443,490 unique leaf nodes Linked as streams by: 73,564,607 bytes over 9,027 unique DAG-PB nodes Taking a grand-total: 1,121,169,329 bytes, 39.83% of original, 2.5x smaller
Your suggestion Taking a grand-total: 1,313,530,216 bytes, 46.67% of original, 2.1x smaller
... Forming DAG covering: 2,867,767,609 bytes of 1,101,572 logical nodes Dataset deduped into: 1,260,530,049 bytes over 481,927 unique leaf nodes Linked as streams by: 53,000,167 bytes over 6,337 unique DAG-PB nodes Taking a grand-total: 1,313,530,216 bytes, 46.67% of original, 2.1x smaller
The 40% vs 47% difference may not seem like much, but this is just 3 files.
Well... as you pointed out earlier, it's actually more like 500,000 files packed into 3 container files :-)
That 7% size saving was at the cost of nearly 50% more logical nodes and unique DAG-PB nodes. If IPFS operations scale O(N^2) on the number of logical nodes, this is not a win. It did reduce the number of unique leaf nodes by about 7% also though. This is why we need to know the impact of of block size and block numbers on IPFS to assess the value of these.
If we tighten up the pad-tolerance down to something like
8 bytes
we can squeeze another 2% without doing anything else: Taking a grand-total: 1,074,222,446 bytes, 38.16% of original, 2.6x smaller... Forming DAG covering: 2,931,471,303 bytes of 2,507,471 logical nodes Dataset deduped into: 957,501,863 bytes over 481,140 unique leaf nodes Linked as streams by: 116,720,583 bytes over 14,416 unique DAG-PB nodes Taking a grand-total: 1,074,222,446 bytes, 38.16% of original, 2.6x smaller
And this 2% saving was at the cost of increasing the logical nodes by nearly 130%.
Of course this is all super-unscientific and off-the cuff. I am spending the next several days putting together actual rigorous numbers from a number of datasets so I (or others ) can figure out what works and what doesn't.
I am also almost done adding a better breakdown of what went towards "data" and what is "padding" to get better indication of where "blocks are spent".
I worry that too much complexity in the chunkers is going to evolve into a hand-tuned expert system trying to get the best result specific to each and every different file type. In my experience these evolve to a point where people become become too scared to change them, since every attempt to produce better results for one corner-case ends up breaking several unrelated others. The end result is barely better than a simple generic solution, and usually ends up with some degenerate corner-cases left where it's worse. It's also waaay more complicated.
Having said that, you've sold me on the idea of a zero-padding chunker together with some kind of buzhash/polyhash chunker. There are lots of cases where zero-padding is used and would make good chunking points, like sparse files, filesystem snapshots, etc. For file types without zero-padding it will not make any difference, and it's very low overhead to include it.
Are you ready for me to send a pull request for a polyhash chunker? What should I send it against?
FTR, I've done some testing and analysis of chunker algorithms after reading the FastCDC paper here;
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
The results were a bit surprising to me. TLDR; FastCDC is not faster and has worse deduplication than a standard exponential chunker for the same avg/min/max chunk size and rollsum. The optimal target/min/max settings for standard chunker for speed and deduplication are very different from the generally recommended values. The best settings are tgt_len=min_len,max_len>=4*tgt_len+min_len, for an average length of min_len+tgt_len. For speed min_len can be increased as high as 2x tgt_len for a small sacrifice in deduplication (but still better than FastCDC).
[...]
Note that you can also make buzhash (or polyhash) chunk on zero runs without needing a pad-chunker. You just need to set
state-target
to the buzhash of 32 zero's, which interestingly with the hash-table mapping it to0x6236e7d5 gives a buzhash of 0 (is that by design?).
Note that buzzhash using a 32byte sliding window will ALWAYS return either 0 or 0xffffffff for a 32byte run of any character, depending on if the hash-table mapping for that character has an even or odd number of 1's. This is one of the buzzhash corner-cases that makes it weaker than eg rabin-fingerprints or PolyHash (AKA RabinKarp). The hashtable-mapping used by IPFS seems to try and have an even distribution of 1's and 0's, so most (all?) mapping values have an even number of 1's and will give a zero result for a 32 byte run of most (all?) byte values. This corner-case can be avoided by using a sliding window size !=32. Alternatively you can take advantage of it to always or never have break points after runs of characters.
I have also done some comparisons of rollsums including using 32 byte windows here;
https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst
I want to leave a reference to Microsoft's work from 2012 (yes pretty old), which is the basis for the Data Deduplication feature in Windows Server 2016. This feature is pretty successful as a general purpose deduplication:
PS C:\Users\Administrator> Get-DedupStatus|Select-Object
FreeSpace SavedSpace OptimizedFiles InPolicyFiles Volume
--------- ---------- -------------- ------------- ------
251.9 GB 153.7 GB 22 23 E:
313.01 GB 134.42 GB 17 17 F:
917.06 GB 588.17 GB 7 7 V:
878.46 GB 569.42 GB 7 7 W:
672.68 GB 3.81 GB 41 41 H:
I want to leave a reference to Microsoft's work from 2012 (yes pretty old), which is the basis for the Data Deduplication feature in Windows Server 2016. This feature is pretty successful as a general purpose deduplication:
I did some analysis/comparison of that work against various alternatives here;
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
IMHO that paper has 2 interesting contributions;
This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase.
You can learn more in the FAQs for the Boxo repo copying/consolidation effort.
Here are some of the issues I will reference:
There are five main concerns: