File changes without pandemic rehashing

Gozala commented 5 years ago

Hi folks,

Recently I've learned about jump-rope data structure which is similar to merkle trees but with addition that it avoid pandemic rehashing if e.g. one byte is inserted in the middle of the file. It had being presented at the the strange loop.

I am bringing this up here because it seem like a very handy property to have in a filesystem.

aschmahmann commented 5 years ago

Looks interesting. If I understand correctly, the pandemic rehashing experienced with UnixFS is dependent on what type of chunker you use for a given file. For example, while a naive "split every X bytes" chunker will lead to lots of rehashing on an insert/delete, I don't think the Rabin chunker should experience the same type of issues as it uses the same rolling hash strategy as the jump-rope.

Hopefully other people here will be able to give a better understanding, but it seems to me that even with the current spec proposal the jump-rope could probably be implemented as just a chunker that creates the UnixFS object.

mikeal commented 5 years ago

For example, while a naive "split every X bytes" chunker will lead to lots of rehashing on an insert/delete, I don't think the Rabin chunker should experience the same type of issues as it uses the same rolling hash strategy as the jump-rope.

Rabin is, overall, more expensive during import and re-import than a fixed chunker. Rabin only gives us the block boundaries, we still have to hash every block in order to see if it has changed or not. Since rabin is much slower than a fixed length chunker due to its rolling fingerprint the whole process is slower, and it will usually produce more blocks than a fixed length chunker so that means more hashing of the blocks.

Haven’t looked at “jump-rope” data structures before, but I have looked at “rope” data structures which I assume these are based on. Rope data structures are much nicer when editing, appending and slicing, but we actually don’t do that too much.

The workflow w/ IPFS tends to be:

Import a file.
Import the file again to see if it has changed.

Since we can’t inject code into the filesystem to handle changes, we don’t really know if a file has changed or not and need to re-import it in order to tell.

With the Rabin chunker we limit the number of blocks that change in our data structure when minor changes happen in the file, but we still end up chunking/hashing the whole file in order to tell which part has changed. We don’t store any data from the rolling fingerprint that would help during another import in order to make this faster, and I’m not sure you can with Rabin (I haven’t seen an example of this but maybe it’s possible). This constant re-import is the only option we have when we’re looking at files on disc and comparing them, which is the primary workflow, although we do have an API for in-place writes if you write directly to IPFS.

The only API we have for in-place mutations is the write API which takes an optional offset. This API is the only one that could potentially benefit from a rope-based data structure, but we already have pretty good data structures for this, especially if you’re using the Rabin chunker, and we already don’t have to re-hash all the chunks in the file, only the chunk(s) you specifically want to write (+ up to 2 chunks on either side of the write if your new write splits previously written chunks on either side). Now, I don’t know how efficient our currently implementations are, but the data structures are there for a pretty optimal edit, particularly if you use Rabin.

mikeal commented 5 years ago

Also, I should mention that one hazard with editing this way is that the file hash is unlikely to match if you do a full re-import of the file again, even though the underlying data is the same, because the chunk boundaries are unlikely to be identical.

warpfork commented 5 years ago

I want to reiterate, just in case this gets lost somehow, that the main reason folks who work on IPFS tend to think Rabin-chunking is slow... is because our implementation, currently, is bad. It is not because Rabin-chunking is inherently slow.

Our current implementation of a Rabin-chunker in go-ipfs causes about two mallocs per byte. This is clears throat not good. It is also not hard to improve upon. Dramatically.

It just requires someone to sit down and pay full attention to it for long enough to land a change.

aschmahmann commented 5 years ago

Since we can’t inject code into the filesystem to handle changes, we don’t really know if a file has changed or not and need to re-import it in order to tell.

That's not really true. With mount we can certainly inject code into the filesystem. Additionally, while it's nice to support timestamps to help with things like rsync, realistically the whole "reimport the whole thing every time" strategy isn't particularly brilliant and we should probably prepare for people who want to be a bit smarter at the application level.

the file hash is unlikely to match if you do a full re-import of the file again

Well the import functions are all deterministic so this shouldn't be an issue. If the concern is configurable parameters then we already have that problem and should put together solutions for it. For example, if the same data is imported but with a different multicodec, CID version, unixFS version, ...

mikeal commented 5 years ago

I want to reiterate, just in case this gets lost somehow, that the main reason folks who work on IPFS tend to think Rabin-chunking is slow... is because our implementation, currently, is bad. It is not because Rabin-chunking is inherently slow.

Our implementation may be poor but the current algorithm does have some performance constraints. According to @mafintosh there’s a paper that details a new method that should be much faster and I believe he even has someone working on an implementation. Regardless, it is a basic fact that doing anything smart will always be slower than fixed length chunking because it bears practically zero computation cost 🤓 It’s a worthy tradeoff for de-duplication but it is a tradeoff.

The perf difference between chunkers should be quite negligible when editing though, it’s most visible when initially importing a large amount of data.

Well the import functions are all deterministic so this shouldn't be an issue

I should have been clearer in my statement. If you use write to mutate part of a file and we edit it in-place instead of re-importing the full file the resulting file hash is unlikely to match the hash of the same data when doing a full re-import because it’s unlikely you would get the exact same chunk boundaries from the chunker as your edits happened to give you.

That's not really true. With mount we can certainly inject code into the filesystem.

Good point. If we connect mutation operations up to the write API and it’s implemented efficiently we’ll avoid full re-imports when files are edited.

Gozala commented 5 years ago

I may be misunderstanding something, but I believe “jump rope” unlike rabin-chunker will end up mostly same chunks if modified file is just imported. That is because chinking algorithm is deterministic.

From what I got jump-rope is inspired by rsync and skip-list

Stebalien commented 5 years ago

@Gozala

The only way to avoid hashing blocks is to edit in-place. This doesn't require any special data structures.

However, a content-based chunker would allow us to edit in-place and maintain convergent chunk boundaries. That is, the version edited in-place would have the same final CID as a re-import of the same file (assuming we re-chunked properly when editing in-place).

@Gozala

WRT ropes: we definitely don't want a rope. A rope has pointers from each chunk to the next chunk. Given that we use content addressing, a change in a chunk would change every chunk directly or indirectly pointing to that chunk.

@mikeal

Rabin is, overall, more expensive during import and re-import than a fixed chunker. Rabin only gives us the block boundaries, we still have to hash every block in order to see if it has changed or not. Since rabin is much slower than a fixed length chunker due to its rolling fingerprint the whole process is slower, and it will usually produce more blocks than a fixed length chunker so that means more hashing of the blocks.

This is misleading.

Since rabin is much slower than a fixed length chunker due to its rolling fingerprint the whole process is slower.

The cost of sha256 should dwarf the cost of rabbin (just not our implementation). Yes, it won't be free. But it shouldn't even be on the list of things we need to care about optimizing.

and it will usually produce more blocks than a fixed length chunker so that means more hashing of the blocks.

Not necessarily. You can set the min, avg, and max block sizes such that you get the same number of blocks on average.

Also, as long as the blocks are bigger than 32 bytes, more blocks do not mean more hashing (mostly). sha256 is a streaming hash function so the amount of hashing one does is purely a function of the amount of input data, not how that data is chunked.

@mikeal

The workflow w/ IPFS tends to be:

Currently. As noted by @aschmahmann, that's mostly because go-ipfs's mount support has always been a bit broken. Once we fix that, I expect this to change. The primary reason we're working on improving mount support is to allow users to update files in-place without having to re-import everything.

Also, I should mention that one hazard with editing this way is that the file hash is unlikely to match if you do a full re-import of the file again, even though the underlying data is the same, because the chunk boundaries are unlikely to be identical.

Again, with rabin, they should be identical (as long as we check the edited chunks for changed chunk boundaries).

Stebalien commented 5 years ago

(ok, to be accurate, minimum/maximum block sizes mean that we won't converge in all case, even if we use rabin)

Gozala commented 5 years ago

The only way to avoid hashing blocks is to edit in-place. This doesn't require any special data structures.

However, a content-based chunker would allow us to edit in-place and maintain convergent chunk boundaries. That is, the version edited in-place would have the same final CID as a re-import of the same file (assuming we re-chunked properly when editing in-place).

I was not trying to imply jump-rope avoided hashing blocks, just that due to content derived chunking algorithm all unmodified ranges would end up with chunks / hashes as pre modified version. Which seems like a great property in terms of deduplication / sharing.

WRT ropes: we definitely don't want a rope. A rope has pointers from each chunk to the next chunk. Given that we use content addressing, a change in a chunk would change every chunk directly or indirectly pointing to that chunk.

Well the naming jump rope is unfortunate as it seems to mislead into thinking it’s a rope. I think it’s worth looking into even if ultimately deciding that it provides little to no advantage over rabin chunking.

If I find spare cycles I’ll try to take another deeper look to write up concise summary of it here

Stebalien commented 5 years ago

Hm, actually, it appears that I misunderstood what a rope is. It looks like we're talking about the same thing (approximately): using the content to figure out how to turn a file into a tree.

rvagg commented 1 year ago

closing for archival

ipld / legacy-unixfs-v2

File changes without pandemic rehashing #30