ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 31 forks source link

pinning index (as ipfs objects) and cdb discussion #4

Open jbenet opened 9 years ago

jbenet commented 9 years ago

this is a transplant from: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/ipfs-users/BI-P0H41-2E/d4Gd0akbYPoJ

jbenet commented 9 years ago

@tv42 said:

Hi. I'm looking at switching pinning state from datastore to IPFS objects. The datastore will just point to the root of "the pinning tree". Direct and recursive pins are essentially sets, and indirect pins are current a map {key: refcount} but I think a better model for those is going to be a multi-valued key-value store, with {key1: pinner1, key1: pinner2, key2: ...}, etc.

Pin data changes rarely, but needs reasonably fast lookups.

We need to just get something done fairly quickly, so simple is better than state of the art, right now. Here's my current draft. Feedback is welcome.

Key-value store using IPFS objects

IPFS objects have schema (Links, Data) where Links is a list of (Name, Size, Hash). Name must be UTF-8 and human friendly [pending https://github.com/ipfs/go-ipfs/issues/1172 ], so we choose to leave it empty. Link Hash cannot be empty.

Each object Data is a CDB file:

There are three kinds of objects:

Empty

The well-known hash value QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n has no Links and no Data. We will use it to signal empty.

Small

An object with no Links is small.

Its Data is a CDB file, containing keys and values.

When adding key K with value V, nodes are kept small if size + 24 + len(K) + len(V) + 4 <= maxBytes (for explanation of the 4, see large objects).

If removing a key results in a small object becoming empty, its hash in the parent can be changed to empty.

Large

An object with Links is large. Its Data is a CDB file, just as with small objects, with an extra 4 random bytes at the end. The extra bytes are ignored by CDB tools.

A large object is kept as large as long as it has even one Link that is not Empty. If a large object is made small, it's Links are removed, and the last 4 bytes of Data are trimmed.

When a small object is made large, it has fanout Links added, each set to empty and 4 random bytes appended to Data. fanout may have different values at different versions of the software. While an object is large, the number of Links does not change (to preserve partitioning).

If a key is not in Data, the Link to look it up in is the one at index h(seed || key) % len(Links) where h is a 32-bit FNV-1a hash, seed is the last 4 bytes of Data, and || is a string concatenation operation.

Removing keys may make a large Data be smaller than maxSize. This is valid. Adding keys adds them to Data if there is room.

If a key exists in Data, it will not be looked up in Links. This means removing a key must recursively remove the key from the relevant child in Links, to avoid shadowed old values from becoming visible.

jbenet commented 9 years ago

@wking said

On Thu, Apr 30, 2015 at 07:42:20PM -0700, Tommi Virtanen wrote:

Direct and recursive pins are essentially sets, and indirect pins are current a map {key: refcount} but I think a better model for those is going to be a multi-valued key-value store, with {key1: pinner1, key1: pinner2, key2: ...}, etc.

If a key is not in Data, the Link to look it up in is the one at index h(seed || key) % len(Links) where h is a 32-bit FNV-1a hash, seed is the last 4 bytes of Data, and || is a string concatenation operation.

Why hash the keys with a seed, since the keys are already multihashes (the pinned node's ID)? I asked this on IRC 1, but I'm still not clear on what non-multihash keys you intend to store in this structure. Hashing keys a second time isn't going to break anything, but it does sound like it will make lookup slower, especially with a different random seed for each object in the pin tree.

Other than that, this looks reasonable to me. But, as you may have gathered from our IRC discussion, I'm not really a low-level guy, so take my opinions here with a grain of salt ;).

Cheers, Trevor

jbenet commented 9 years ago

I'll think about this more and reply with something more solid tomorrow, but basic feedback:

jbenet commented 9 years ago

also tagging (for keeping in the loop + surfacing thoughts if relevant): @whyrusleeping @cryptix @krl

tv42 commented 9 years ago

Without something in the x % len(Links) changing, the tree is worthless beyond the first level: more levels wouldn't splay out the keys any more.

If I put the pins in the Links.Hash, things change:

  1. quick lookups no longer exist (have to scan the slice). I could, however, build an index in Data, and get us back to this point (except without being able to reuse cdb code).
  2. multi-level becomes harder; how will I know which Links point to pinned objects, which ones just point to more pins. I guess I could put a bitmap in Data, too. Or abuse Link.Name for that, but that feels like abuse.
tv42 commented 9 years ago

Thinking out lout: mph/cdb-style hash table in Data, result is 0<=i<=n, answer is Links[i] if Links[i].Hash == key. Links > n are inner nodes.

wking commented 9 years ago

On Fri, May 01, 2015 at 07:47:05AM -0700, Tv wrote:

Without something in the x % len(Links) changing, the tree is worthless beyond the first level: more levels wouldn't splay out the keys any more.

Ah, right. So you'd want to trie the cypto hash and shard QmSomePinnedHash into sh/Ha/QmSomePinned (or use the prefixes after stripping off the non-crypto bits, or whatever, see jbenet/multihash#1). That means that you'd have to know your depth to figure out which part of the multihash to use, but tracking depth while you traverse the tree sounds cheaper than re-hashing with a random seed for each object.

If I put the pins in the Links.Hash, things change:

  1. quick lookups no longer exist (have to scan the slice).

I don't understand “scan the slice”, with sorted link lists (ipfs/go-ipfs#915) you can do a binary lookup to find your target key quickly. If you were looking for a particular pin (e.g. for insertion or removal) and using links (instead of cdb), I'd suggest logic like:

  1. Look for a key matching your full multihash in the link lists. a. If you find it, that's your entry. For inserts, there's nothing to do. For deletes, remove the entry, and possibly collapse the object into it's parent if the removal shrinks it below your size threshold. b. If you don't find it, get the depth-appropriate slice from your crypto hash, and lookup that key in the link list. i. If you find it, enter that node, increment the depth, and go back to (1). ii. If you don't find it, your hash isn't listed. For deletes, there's nothing to do. For insert, if the addition will keep you under your size threshold, add the multihash as a link; if the addition will bump you over your size threshold, create a new child node for the key, and push all the full-multihash link keys with a matching depth-appropriate slice down into the new child node.
  1. multi-level becomes harder; how will I know which Links point to pinned objects, which ones just point to more pins. I guess I could put a bitmap in Data, too. Or abuse Link.Name for that, but that feels like abuse.

Just use an invalid multihash prefix for the depth-appropriate slice entries. The multihash spec reserves 0x00-0x0f for application-specific functions, so the base-whatever encoded version of 0x00 0x00 … should be a reasonable prefix. As a bonus, this sorts your link name space into {all the sub-object links} and {all the full multi-hash links} sets, and you can cache that split boundary from the (1) search to make the (b) search faster.

As an aside, I'm not sure how, given a byte string the might be a base-whatever-encoded multihash, you should determine what base the encoding is in. Are multihashes in link values required to be in base-58? Are multihashes in the link values just raw binary multihashes? If we want to use keys in the link names we should probably either pick a standard encoding, record the chosing encoding information somewhere (type? ipfs/ipfs#36), or allow binary link names (ipfs/go-ipfs#1172). Of those choices, I like storing the multihash-to-UTF-8 encoding in a type reference best.

Anyhow, I'd be surprised if keeping everything in links is going to perform as well as the cdb approach, but it sounds more native and like less work, so it may be easier to implement and fast enough. I don't mind if we use the cdb approach or not, but I think both approaches are workable. I do think depth-based slicing is better than random-seed rehashes though.

jbenet commented 9 years ago

Without something in the x % len(Links) changing, the tree is worthless beyond the first level: more levels wouldn't splay out the keys any more.

Can use different substrings of the hash. it is a cryptographic hash -- which makes it appear (to a bounded adversary, etc) uniformly distributed at any bit length. So grabbing different substrings helps. -- i think @wking suggests the same above.

multi-level becomes harder; how will I know which Links point to pinned objects, which ones just point to more pins. I guess I could put a bitmap in Data, too. Or abuse Link.Name for that, but that feels like abuse.

unixfs dir's extend each of links in the data section. we could do this, or extend the link protobufs themselves (as has been discussed elsewhere). Agreed about not abusing the name.

tv42 commented 9 years ago

The problem with using the height as input to the hash function is that now debugging by staring at a single object is harder; I'd be inclined to put the level in the object, and at that point, I might as well put in something else. And without a randomized hash, the data structure isn't very generic; it can't be safely & robustly used to store non-cryptohash, user-chosen, keys.

wking commented 9 years ago

On Wed, May 06, 2015 at 02:11:08PM -0700, Tv wrote:

The problem with using the height as input to the hash function is that now debugging by staring at a single object is harder…

How frequently will you need to do this? It doesn't seem all that much harder (one extra parameter to log vs. calculating a new hash for each object you're looking at during your debug pass).

And without a randomized hash, the data structure isn't very generic; it can't be safely & robustly used to store non-cryptohash, user-chosen, keys.

If we're looking at a generic fanout index structure, then this is a reasonable argument. Rehashing the key your searching for at every object is not that steep a cost. But pushing the link payloads into a cdb payload structure seems like too much of a stretch for this approach to be a generic solution. If we keep the links and their target hashes in the Links section 1, then we get interoperability with the usual IPFS linking, but then the cdb is less useful…

I dunno. I think it's hard to get efficient fanout and name-based lookup with the dual constraints of the base IPFS object spec and protobuf's variable-length fields (for more thoughts on indexing vs. protobufs, see ipfs/archive-format#6).

tv42 commented 9 years ago

Next draft. I did not change the hash function yet, but now that the keys are actually stored in Links.Hash they have to be IPFS Keys anyway, so assuming uniformness is more acceptable.

Set/Multiset of Keys as IPFS object

IPFS objects have schema (Links, Data) where Links is a list of (Name, Size, Hash). Name must be UTF-8 and human friendly, so we choose to leave it empty. Link Hash cannot be empty.

We implement a Set and Multiset of IPFS objects, using IPFS objects for storage.

For our purposes, we divide Links into two groups:

fanout changes rarely if ever, while n grows and shrinks dynamically.

The object always has at least fanout Links (see below for where that value is stored). Those links may point to the well-known hash value QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n that has no Links and no Data. We will use it to signal empty.

For Set: Keys in items are distinct. As a write optimization, a key in items may also have a copy in the correct subtree.

For Multiset: items may contain the same key multiple times, and the correct subtree may contain even more entries for the key. The numbers of times a Key has been added to the Multiset is the sum of all these reference counts.

The object Data consists of two sections:

The protobuf message is described by

message Set {
    // 0 for now, library will refuse to handle entries with version greater than recognized.
    required uint32 version = 1;
    // how many of the links are subtrees
    required uint32 fanout = 2;
    // hash seed for subtree selection, a random number
    required fixed32 seed = 3;
}

Refcounts

To be useful in an array, we need fixed size refcounts. However, there is no limit to how many times an IPFS object may be referenced. To support this, we allow Multisets to contain multiple instances of the item, both in the same node and in the correct subtree.

Version 0 of the format uses uint8 refcounts.

fanout changes

(This can be left unimplemented for now.)

We can avoid wasting space for fanout empty Links in leaf nodes by making new, small, nodes have fanout 0.

fanout changes would change the hash decisions of all items in subtrees, but we can choose to only make such changes at times when that cost is non-existent.

Leaf to node: Changing fanout from 0 requires inserting fanout empty items at the beginning of Links. This can be done at the time a leaf node becomes too big, and is split into an inner node with multiple leaves.

Node to leaf: Changing fanout to 0 can happen at a time when all fanout Links are empty. This is to preserve subtree partitioning.

The exact fanout value may have different values at different versions of the software.

Lookups

If a key is not in the n directly stored keys, the Link to look it up in is the one at index h(seed || key) % n where seed is the bytes of the uint32 in little-endian order, h is a 32-bit FNV-1a hash, and || is a string concatenation operation.

Removing keys may decrease n, all the way down to 0. Adding keys adds them to Links as long as n is small enough (details are implementation specific and not to be relied upon).

Optimization: decreasing write amplification

To allow for cheap updates, Add operation can just insert keys in the root node, without first checking whether the key is present in the subtree. This can lead to multiple instances of the key being present in the tree.

Thus, removing a key must recursively remove the key from the relevant child in Links, to avoid shadowed old instances from becoming visible.

Multiset refcount decrement can decrement the first instance it finds, stopping there. Increment can opportunistically increment an existing entry, adding a new entry if one is not found. Refcounts can be merged at any time when they do not overflow.

wking commented 9 years ago

On Fri, May 08, 2015 at 09:58:31AM -0700, Tv wrote:

Next draft.

I'm liking this more :). Hopefully it's performant enough that we don't end up getting pushed back to in-data cdbs, because then I'd feel pretty silly :p.

I did not change the hash function yet…

Yeah, this is a minor issue for me too. Especially before we have something available to benchmark.

// 0 for now, library will refuse to handle entries with version greater than recognized.
required uint32 version = 1;

I'm fine with this, and it's the traditional approach. But for the archive format I'm hoping to use a multihash version to get the self-describing features discussed in ipfs/ipfs#36. In that case, I'd just store the accepted version of a spec like this in a unixfs file, and use its multihash as the version for pinning nodes following that spec. Pros and cons:

On the balance, I think the pros outweigh the cons, especially for specs that we expect to be fairly stable. But again, I'm fine if this all sounds like crazy-talk and you'd rather store the version information in an incremented integer.

If a key is not in the n directly stored keys, the Link to look it up in is the one at index h(seed || key) % n

‘% n’ should be ‘% fanout’.

To allow for cheap updates, Add operation can just insert keys in the root node, without first checking whether the key is present in the subtree. This can lead to multiple instances of the key being present in the tree.

Thus, removing a key must recursively remove the key from the relevant child in Links, to avoid shadowed old instances from becoming visible.

Works for me. I don't mind who pays the subtree-drilling cost. One issue to consider is atomicity, but in either case I expect we'd only update the root-pin-object hash after completing the operation, so we'd keep a consistent state regardless of power-outages, etc. The more expensive operation just runs a higher risk of failure, and catastrophic failures like that should be rare enough that it doesn't matter who we pick.

Multiset refcount decrement can decrement the first instance it finds, stopping there. Increment can opportunistically increment an existing entry, adding a new entry if one is not found. Refcounts can be merged at any time when they do not overflow.

This seems like a useful thing to do during the garbage-collection sweeps, and should be fairly efficient if the traversal algorithm is depth-first through the subtrees in parallel with a walk across the items. That's going to give you one item-walker per level, but this tree shouldn't be so deep that that's a prohibitive cost. And doing something that periodically refactors this tree during low-use periods would help make subsequent operations more efficient.

tv42 commented 9 years ago

@wking

Especially before we have something available to benchmark.

FNV-1a of an IPFS key takes about half the time of a single RAM cache miss.

But for the archive format I'm hoping to use a multihash version

That pretty quickly leads to a world where objects know their own type, and at that point that shouldn't live in Data, but alongside Links and Data. I'm not quite convinced it's worth storing that in every object. In any case, this is easy to convert to that, once the infrastructure exists. For more inspirational reading, see Ethos ETypes.

‘% n’ should be ‘% fanout’.

Yes. Thank you.

One issue to consider is atomicity,

It's a merkle tree, all atomicity comes from flipping the root hash, nothing else is possible.

wking commented 9 years ago

On Fri, May 08, 2015 at 11:56:54AM -0700, Tv wrote:

Especially before we have something available to benchmark.

FNV-1a of an IPFS key takes about half the time of a single RAM cache miss.

In that case we might as well use it, since the person writing the code should get to pick their favorite shed color ;).

But for the archive format I'm hoping to use a multihash version

That pretty quickly leads to a world where objects know their own type, and at that point that shouldn't live in Data, but alongside Links and Data.

I think the “where should this type information live” is what's happening in ipfs/ipfs#36.

In any case, this is easy to convert to that, once the infrastructure exists.

Well, modulo some version-dependent migration, since it won't fit into your version scheme here. But yeah.

tv42 commented 9 years ago

@wking I can still change the "current version" to 1, new versions can leave that tag unused in the protobuf, it'll get the value 0.

wking commented 9 years ago

On Fri, May 08, 2015 at 05:02:35PM -0700, Tv wrote:

I can still change the "current version" to 1, new versions can leave that tag unused in the protobuf, it'll get the value 0.

Oh, right :p.

I think that covers all my concerns with the v2 spec :).

jbenet commented 9 years ago

I like v2 very much!! :+1: :+1: :+1: cool construction


I've some questions which may be obviously stupid:

1, a few times it is mentioned that the key may be stored multiple times. why not store the keys once, and use an array of numbers, like:

links: [ l1, l2, l2, l3, l3 ]
data: { ... }
--- # vs
links: [ l1, l2, l3 ]
data: { counts: [ 1, 2, 2 ], ... }

2, it is possible to also use the names to signal fanout vs n ?

links: [ { hash: k1, name: "fanout", ... }, ... ]
data: { ... }

3, we could bite the bullet and figure out how the hell to make links nicely extensible. (for example, maybe we can designate proto tag numbers above 50 to be for the user. so could make the link:

message SetLink {
  // standard merkledag part
  bytes Hash = 1;
  string Name = 2;
  uint64 Tsize = 3;

  // ours
  bool fanout = 51;
  uint64 count = 52;
}

It may be a bit tricky to parse out, but we may be able to make a nice set of interfaces (a bit capnp inspired) which expose things like:

// in merkledag
type Link struct {}

func (ln *Link) Data() []byte {
  return ln.rawData
}

// in SetLink
func CastLink(ln *dag.Link) *SetLink {
  return NewSetLink(ln.Data())
}

func NewSetLink(data []byte) *SetLink {
  // validates data, check it's a conforming protobuf 
  setLinkCheckProtobuf(data) 
  return &SetLink{data}
}

(sorry, dont mean to co-opt this thread. we can discuss this complicated question elsewhere. Oh and, maybe I'm stupid but so far I find the Any type really cumbersome to use, more than "casting")


I aree with @wking here:

I'm hoping to use a multihash version to get the self-describing features discussed in ipfs/ipfs#36. In that case, I'd just store the accepted version of a spec like this in a unixfs file, and use its multihash as the version for pinning nodes following that spec. ... On the balance, I think the pros outweigh the cons, especially for specs that we expect to be fairly stable.

I think we

at that point that shouldn't live in Data, but alongside Links and Data

As described in ipfs/ipfs#36, Links and Data give us enough-- meaning that we can use a link and name it @type as in JSON-LD. So we'd have:

links: [ { name: "@type", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

or @context

links: [ { name: "@context", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

The importance of doing this in the links is that:

For more inspirational reading, see Ethos ETypes.

I'll read on Ethos ETypes, thanks for posting. may want to check out PADS too (links in ipfs/ipfs#36) -- that's a fantastic piece of work-- even though it's more Type-Theory specific, and takes a long grokking time before the benefits are apparent. And to my knowlede there's no developer-sensible approaches yet. (i.e. no "JSON-LD" to the RDF.)


FNV-1a of an IPFS key takes about half the time of a single RAM cache miss.

I still don't understand why, if we have uniformly distributed data, we need to rehash it? what am i missing?

Like -- depending on how big the seed is, and whether it's properly random -- why can't we just:

combine(seed, key) % fanout

where combine could be:

xor(seed, key[:len(seed)])
(seed || key)
(key[:len(seed)] ** seed)

In that case we might as well use it, since the person writing the code should get to pick their favorite shed color ;).

Yeah, sounds good to me. go for it :) -- just, i don't understand why it's needed yet.

jbenet commented 9 years ago

anyway all this is minor, this SGTM

tv42 commented 9 years ago

On Sat, May 09, 2015 at 03:52:47AM -0700, Juan Batiz-Benet wrote:

1, a few times it is mentioned that the key may be stored multiple times. why not store the keys once, and use an array of numbers, like:

links: [ l1, l2, l2, l3, l3 ]
data: { ... }
--- # vs
links: [ l1, l2, l3 ]
data: { counts: [ 1, 2, 2 ], ... }

With the current design, the Links indexes map directly to indexes in Data. If the items in Data are variable length, that'll no longer hold. That complicates a lot of things, including e.g. sorting the the item Links by Hash.

2, it is possible to also use the names to signal fanout vs n ?

links: [ { hash: k1, name: "fanout", ... }, ... ]
data: { ... }

Yes, it is. But I didn't want to repeat the word "fanout" about 256 times in every object, and couldn't come up with a way to make that cheaper.

Also, if the fanout number is not easily accessible, we need to walk the links and look for where the name changes.

3, we could bite the bullet and figure out how the hell to make links nicely extensible.

Plenty of things there that could be done.

It's worth noting that protobuf as a whole is switching away from that sort of "extensions using numbers" design, into using the Any type with URLy identifiers; the stated reason was that they had too much confusion from ID collisions. But then again, that's wasteful.

A top-level (sibling of Links and Data) protobuf Any would let us include these things in a more structured manner, yet pay the string cost only once per object. The real difficulty is if you want to support GC holding on to links that are only in these extensions. I can explain how to make that work, but it's a bit more complex. It'd essentially look at the types of all fields, named Links or not, and use that as criteria whether that's a link or not.

Venti solves a similar problem by using two streams for one kind of data, one with well-known semantics and the other one with type-specific semantics. The Multiset datatype could just keep another array or such under the Any variable, for the refcount use or its edge cases.

Now, I'd rather get this thing out there in a simple form, first, because the above will have avalanche effects all over the system.

Let's please make a separate ticket for "extensible objects"? If and when that stuff gets actually programmed, we'll migrate this over. Unless you really want me to put this work on hold until that's done.

I aree with @wking here:

I'm hoping to use a multihash version to get the self-describing features discussed in ipfs/ipfs#36. In that case, I'd just store the accepted version of a spec like this in a unixfs file, and use its multihash as the version for pinning nodes following that spec. ... On the balance, I think the pros outweigh the cons, especially for specs that we expect to be fairly stable.

I think we

at that point that shouldn't live in Data, but alongside Links and Data

As described in ipfs/ipfs#36, Links and Data give us enough-- meaning that we can use a link and name it @type as in JSON-LD. So we'd have:

links: [ { name: "@type", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

or @context

links: [ { name: "@context", hash: <hash-to-spec>, ... }, l2, l3, ... ]
data: { ... }

The importance of doing this in the links is that:

  • we don't doom ourselves to buy into one (sure to seem short-sighted and problematic in 5 years) type-structure.
  • all the ipfs link semantics apply (reference + backup specs, can traverse them in paths: .../myobj/@type, etc.)

That part that's hurting this particular use there is that if Links can has all kinds of extra entries, we can no longer safely do things like just compute a link index and follow that as a subtree.

That also conflicts with the idea that Links are unixfs filenames. You really don't want to make "mkdir @context" an invalid operation.

I feel like the idea has merit, but slapping everything into Links is going to be overly problematic. Why not use the protobuf? If you want a type field, why not add a type field.

This conversation doesn't really belong in this ticket. Only one thing matters here: do we put this work on hold waiting for the extensibility discussion to result in working code, or not.

FNV-1a of an IPFS key takes about half the time of a single RAM cache miss.

I still don't understand why, if we have uniformly distributed data, we need to rehash it? what am i missing?

Like -- depending on how big the seed is, and whether it's properly random -- why can't we just:

Because it's hard to convince me the data is always uniformly distributed.

If I publish an object, it contains links that I have carefully crafted. If I can make you pin it, I get to control what goes in your indirect pins. I can very easily generate junk objects until I get out ones with keys that e.g. all have 0x00 in the same location. I can even hide them in a subfolder you're unlikely to look in. And that's assuming the links even have to really point to existing objects to trigger the attack, which I'm not even sure is needed.

(While doing this work, I'm starting to have opinions about how the garbage collection could avoid all this.. but I really don't want to delay this even further.)

:(){ :|:&};:

jbenet commented 9 years ago

Now, I'd rather get this thing out there in a simple form, first, because the above will have avalanche effects all over the system.

Yeah, that sounds good to me :+1:

Because it's hard to convince me the data is always uniformly distributed.

Ah! that's a good point. yep, we do need to hash :+1:

tv42 commented 9 years ago

Final version, needs to be archived somewhere, probably in this repo. Note that it's not specific to pinning.

Set/Multiset of Keys as IPFS object

IPFS objects have schema (Links, Data) where Links is a list of (Name, Size, Hash). Name must be UTF-8 and human friendly, so we choose to leave it empty. Link Hash cannot be empty.

We implement a Set and Multiset of IPFS objects, using IPFS objects for storage.

For our purposes, we divide Links into two groups:

fanout changes rarely if ever, while n grows and shrinks dynamically.

The object always has at least fanout Links (see below for where that value is stored). Those links may point to the well-known hash value QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n that has no Links and no Data. We will use it to signal empty.

For Set: Keys in items are distinct. As a write optimization, a key in items may also have a copy in the correct subtree.

For Multiset: items may contain the same key multiple times, and the correct subtree may contain even more entries for the key. The numbers of times a Key has been added to the Multiset is the sum of all these reference counts.

The object Data consists of two sections:

The protobuf message is described by

message Set {
    // 1 for now, library will refuse to handle entries with version different from what's recognized.
    required uint32 version = 1;
    // how many of the links are subtrees
    required uint32 fanout = 2;
    // hash seed for subtree selection, a random number
    required fixed32 seed = 3;
}

Refcounts

To be useful in an array, we need fixed size refcounts. However, there is no limit to how many times an IPFS object may be referenced. To support this, we allow Multisets to contain multiple instances of the item, both in the same node and in the correct subtree.

Version 1 of the format uses uint8 refcounts.

fanout changes

(This can be left unimplemented for now.)

We can avoid wasting space for fanout empty Links in leaf nodes by making new, small, nodes have fanout 0.

fanout changes would change the hash decisions of all items in subtrees, but we can choose to only make such changes at times when that cost is non-existent.

Leaf to node: Changing fanout from 0 requires inserting fanout empty items at the beginning of Links. This can be done at the time a leaf node becomes too big, and is split into an inner node with multiple leaves.

Node to leaf: Changing fanout to 0 can happen at a time when all fanout Links are empty. This is to preserve subtree partitioning.

The exact fanout value may have different values at different versions of the software.

Lookups

If a key is not in the n directly stored keys, the Link to look it up in is the one at index h(seed || key) % fanout where seed is the bytes of the uint32 in little-endian order, h is a 32-bit FNV-1a hash, and || is a string concatenation operation.

Removing keys may decrease n, all the way down to 0. Adding keys adds them to Links as long as n is small enough (details are implementation specific and not to be relied upon).

Optimization: decreasing write amplification

To allow for cheap updates, Add operation can just insert keys in the root node, without first checking whether the key is present in the subtree. This can lead to multiple instances of the key being present in the tree.

Thus, removing a key must recursively remove the key from the relevant child in Links, to avoid shadowed old instances from becoming visible.

Multiset refcount decrement can decrement the first instance it finds, stopping there. Increment can opportunistically increment an existing entry, adding a new entry if one is not found. Refcounts can be merged at any time when they do not overflow.

sivachandran commented 8 years ago

Is there anything implemented related to this? Is there a way to store key-value pair and retrieve/lookup the value by key?