Sharding IPLD objects - Githubissues

jbenet commented 8 years ago

This note gathers the constraints + will drive toward a design of object sharding in IPFS and IPLD. Object sharding is the algorithms and formats used to represent a single (virtual) large object out of many smaller ones. Think of this like the way large directories are represented in modern filesystems (RB-Trees, B-Trees, HTrees, etc).

Sharding IPLD objects in general is a useful thing. instead of implementing it for unixfs and other datastructs each time, we could implement it once. it could be a datastruct the others employ, or maybe -- if it is simple enough -- it belongs as part of IPLD itself.

Constraints to support:

efficient in the small case (1 to 5 nodes)
allows user-chosen sharding (eg for small numbers of nodes, may want specific construction)
large fanouts (millions or billions)
efficient access
minimize insertion re-writes (shadowing/cloning)
upgradeable algorithms (can signal which sharding algo via version, or even with a key/val)
union style fanouts
hierarchical style fanouts (patricia tries)

For large fanouts, look at

all filesystems research into indirect block topologies
ext4 dir entries (list + tree) https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries
HTree https://en.wikipedia.org/wiki/HTree
PHTree http://phunq.net/pipermail/tux3/2013-January/000026.html
B-trees + shadowing + clones http://liw.fi/larch/ohad-btrees-shadowing-clones.pdf
djb's cdb http://cr.yp.to/cdb.html
@tv42's set + multiset on ipfs (used already for 0.4.0+ pinset) https://github.com/ipfs/notes/issues/4#issuecomment-101041755

case for supporting it on-top of IPLD

It is nice that the IPLD spec is very simple. Finding a nice way to support this without complicating it much will be hard-- the constraints above do not bode well for this.
can define it as a different datastruct, should not be hard for other datastructs to extend it
flexible algorithms for sharding may complicate IPLD
case for supporting it in IPLD
we could have a very powerful datastructure if sharding came everywhere
merkle-linking in IPLD is already like hierarchical fanout sharding of a single massive tree, this is just sharding within a single level.
IPLD already has flexible algos in multicodec
could use a directive like @shard or something
could be an IPLD extension if not properly in core spec.

cc @whyrusleeping @lgierth @diasdavid @cryptix @ion1 @mildred @tv42 @wking

ion1 commented 8 years ago

It might be useful to take advantage of chunking by a rolling hash with large flat data structures receiving small arbitrary changes, such as a directory.

davidar commented 8 years ago

Radix trees have the benefit of being deterministic (invariant to the order of insertions), which helps with deduplication.

Edit: also see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452

jbenet commented 8 years ago

cc @whyrusleeping comment here.

jbenet commented 8 years ago

From the IPLD spec, one very trivial but extensible thing to do all-IPLD-object-wide is something like this:

{
  "@extends": [
    "/ipfs/<hash1>",
    "/ipfs/<hash2>",
    "/ipfs/<hash3>",
  ],
  ... // more properties
}

which would merge the objects like so:

o1 = get(/ipfs/<hash1>)
o2 = get(/ipfs/<hash2>)
o3 = get(/ipfs/<hash3>)
o4 = get(/ipfs/<hash4>) // our object above

merged = _.extends(o1, o2, o3, o4)

meaning that we start with the first, and we patch in order. so a property in o4 replaces those preceding it. this is very simple and can likely be used to support the desires mentioned earlier in this issue (more complicated structures).

It would be useful to have examples of directory sharding with:

B-trees + shadowing + clones
a Radix tree

expressed as the corresponding JSON or YML syntax for IPLD. (i use JSON above because it tends to be clearer, but i prefer YML). that way we can see what it takes to support it natively in IPLD.

davidar commented 8 years ago

Personally I'd like it if this was a core part of IPLD, so that large objects were transparently sharded without the user having to explicitly do anything (much like the transparent merkle-linking). Here's an example of a manual radix tree though:

r:
  om:
    an:
      e: 1
      us: 2
    ulus: 3
  ub:
    e:
      ns: 4
      r: 5
    ic:
      on: 6
      undus: 7

Sub-trees can be split into separate (merkle-linked) objects, analogously to what we already do with chunking linear streams.

jbenet commented 8 years ago

also https://en.wikipedia.org/wiki/HAT-trie

mildred commented 8 years ago

I think this should not be implemented below the basic IPLD data model: you want, depending on the application, to control where to split objects. For example, metadata should be easily accessible while the complete file data could take a little more time to load.

I think this should be an extension of IPLD. And you could have the choice to:

fetch only the requested object
fetch the logical record composed of multiple objects reassembled into a bigger one

It might be necessary to think how exactly we are going to reassemble the objects. Which algorithms. In particular. Especially considering that's not just about tree merging, but also data string merging.

For example, an object might be composed of a huge binary data string (for example the object representing a huge file). This data string shoud be able to be split among different physical IPLD objects.

The way I would implement it is to have index objects that links to sub parts, and specify with the link the way they are merged. Something like jbenet suggested

@davidar I disagree with you when you say that it should be completely transparent. The application should have control over the splitting of files. We could have a library that does a good job of splitting when the application author doesn't want to bother, but we should be able to have full control.

jbenet commented 8 years ago

Agree in full with @mildred

jbenet commented 8 years ago

Some links for references.

https://en.wikipedia.org/wiki/Hash_array_mapped_trie
search the persistent data struct lit more.
http://www.staff.city.ac.uk/~ross/papers/FingerTree.html
I started making a "recursive hash table", that's similar to typical associative array optimizations.
at the extreme of that lie thigns like Judy Arrays (tries)
And ofc, I keep returning to finger trees, and patricia tries.
https://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf

A tricky problem here is tuning the trade off between "good datastructures that create very small nodes" and "good sized objects to minimize the hashing costs, cache misses, and network ops". Basically, there's sharding, which represents one object with many small ones. and then there's aggregation, to represent many small objects with a big one. Most good approaches to sharding from data struct literature yield excellent datastructs with many small objects, which we'd want to aggregate. aggregation is tricky, because making it stable (convergent, for hashing) and efficient (minimize update costs) is challenging. Aggregation is not straight forward; because "bundling many objects in the underlying transport instead of in the data model" is not enough, as that does not minimize hashing and random access advertisements.

jbenet commented 8 years ago

Parameters

the max bucket size (fanout) should be configurable, and stored in the HT root node.
likely HAMT params should be configurable

Paging

keeping nodes in memory is a tough choice. will want to amortize loading things. and will also want to "page things out".
we're going to run into this everywhere.
ex: mfs pages out files, but maybe keeps dirs in memory for editing. what happens when the dir hierarchy is too big? will need to page out too.
we should be able to use these algos and restrict their memory usage, and have them be able to respond to memory pressure (on proc, and just programmatically) -- suggests a sub-proc abstraction.
at the very least we should make sure to address in the algo specs how to deal with memory pressure how to:
- page in (load)
- page out (should just be release for gc)

Writes

writing for all persistent datastructures involves:
- bubbling writes (bubbling up to the root)
- creating many intermediate objects
- many of which are never pointed to by any externalized root. (meaning the objects are never accessible, so writing them is strictly a waste of resources.)
algo specs should describe how to do:
- sync writes (that write all new objects to ipfs)
- in memory only writes (avoid writing to ipfs until datastruct is finalized)
- and coalesced writes (write, but defer writing when under heavy editing pressure)

Hashing

use non-cryptographic hash functions

other latencies in the read/write pipeline may obviate this
however, once a datastructure is paged in, reads should be able to be very fast

use seeds in children/intermediate HT nodes deterministic to their position in the tree/branch.

HAMT does not need this because it uses the same hash from the root.
but note the HAMT papers say "when run out of hash bits, get more hash bits".
this just means extending the hash function for the given seed in a deterministic way.
example 1:
- good: hash(name, n) => hashfn(n + name) (start this way from 0).
- bad: hash(name) => hashfn(name), then later, n > 0: hash(n + seed).

example 2 (good):

hash(name, n) => {
if (n > 0) 
  name = hash(name, n-1)
return hashfn(name)
}

// hash(name, 0) = hashfn(name)
// hash(name, 1) = hashfn(hashfn(name))
// hash(name, 2) = hashfn(hashfn(hashfn(name)))

[x] @whyrusleeping confirm all this made sense? o/

Differentiating Leaves

algos should specify how to differentiate leaves. (i think HAMT does)

one easy way in IPLD objects though is this https://github.com/jbenet/ipld-ht-experiment/blob/master/rht.js#L137-L150 -- assume nodes that are just IPLD links are always intermediate nodes, and leaves are a nested object (which carry the name/key being inserted). eg

{
// an intermediate datastructure node.
children: [
  {"/": "Qmaaa..."}, // just a link. intermediate
  {"/": "Qmbbb..."}, // just a link. intermediate
  {"/": "Qmccc..."}, // just a link. intermediate
  {"n": "key", "v": {"/": Qmvalue}}, // leaf node (value entry)
  {"/": "Qmeee..."}, // intermediate
]
}

whyrusleeping commented 8 years ago

@jbenet should the names on links pointing to the intermediate nodes just be their index? I think that makes the most sense.

jbenet commented 8 years ago

@whyrusleeping not sure what the index buys you, but sure, that works for the protobuf version. but not the IPLD version.

I think actually that we should make ONE version (IPLD naturally), and project it into protobuf, so that the things are interoperable with IPLD from the get go, and not have two different HAMT implementations.

jbenet commented 8 years ago

Great article on map data structures for external memory — i.e. ways of storing indexed data that are optimized for storage on disk (or networks :) ).

http://blog.omega-prime.co.uk/?p=197

whyrusleeping commented 8 years ago

@jbenet How does a sharded directory work with paths? Should paths be resolved in the context of unixfs? or strictly with merkledag links?

i.e. a sharded node has links:

{
  Links: {
    "AB" : { 'child shard' },
    "FE" : { 'child shard' },
    "12" : { // this is a child shard too, just showing it for the example
      "Links" : {
         "D4ActualDirName" : "dirhash",
      }
    }
}

this would be a sharded directory that contains (among other things) a directory named "ActualDirName". To access this, I'd expect to be able to do /ipfs/<shardhash>/ActualDirName, but /ipfs/<shardhash>/12/D4ActualDirName is the actual path through the raw DAG.

jbenet commented 8 years ago

unixfs should allow resolving the virtual dag transparently.

jbenet commented 8 years ago

https://github.com/krl/ipfs-crdts

daviddias commented 7 years ago

And update on efficient immutable collections that takes on HAMT and CHAMT:

https://michael.steindorfer.name/publications/phd-thesis-efficient-immutable-collections.pdf

HN discussion

Kubuxu commented 7 years ago

Getting HAMT into IPLD would be major step in increasing its usability. It can be done in two ways: by introducing it at resolver layer (it would be transparent), or buy writing out clear spec and few reference implementations on application layer.

I was helping @Magik6k to work on Wikipedia Search using IPLD, he had to build his own hashmap structure (which was static and needed full rebuild everything, also parameters weren't autotuned). This increased time needed for this project significantly.

Kubuxu commented 7 years ago

Also AFAIK we already use partially use principles of CHAMT as we don't allocate full array, and we use bitmap masking and hashing.

daviddias commented 7 years ago

@Kubuxu agreed, I think the direction that everyone wants to is have two layers of IPLD, the first being the one we implemented (The resolver) and the second one that understands things like sharding and other data structs for custom path handling. I know that a few people put a lot of thought into this and we ended up deciding to implement the base layer first, now it seems the right time to start figuring out that second layer.

ipfs / notes

Sharding IPLD objects #76

case for supporting it on-top of IPLD

case for supporting it in IPLD

Parameters

Paging

Writes

Hashing

Differentiating Leaves