ipfs / boxo

A set of reference libraries for building IPFS applications and implementations in Go.
https://github.com/ipfs/boxo#readme
Other
213 stars 95 forks source link

ipld/unixfs/hamt: Review and comment the HAMT package #387

Open schomatis opened 6 years ago

schomatis commented 6 years ago

HATM description

Brief description of what we have at the moment.

The Hash Array Mapped Trie (HAMT, superficially explained in https://en.wikipedia.org/wiki/Hash_array_mapped_trie) is at its heart a tree of DAG nodes that together represent a directory. Instead of storing all of the entries of a directory as DAG links in a single node (BasicDirectory implementation) the HAMTDirectory distributes the entries across the nodes of the tree.

The entries are indexed based on the hash of its name (to allow for an even distribution), this hash shouldn't be confused with the hash inside the CID that addresses the contents of the entry (node) we're pointing to, the HAMTDirectory cares only about the names of the links. Two entries could point to the same content (file) with different names and they would be saved in different positions in this tree.

This tree is actually a prefix tree (trie) so each node has only a part of the hash of the entry name, all of its children share the same prefix (sub-hash) of the parent node, with each children having a different hash continuation to split different entries.

From the DAG layer perspective a big graph representing a directory and the files it contains can actually be divided in first a sub-graph (starting at the root) that represent the directory and from that other subgraphs representing the files it points to.

* `[`: node belonging to the `HAMTDirectory`.
* `{`: node belonging to an entry (any UnixFS type of file,
       can be another `HAMTDirectory`) pointed to by the directory.

[ Root (no content) ]
   |
   |
   |  0xAF (prefix)
   |------------------> [0xAF]
   |                      |  
   |                      |  
   |                      |  0x14
   |                      |----------> [0xAF14] ---> {file with hashed name
   |                      |                            starting with 0xAF14}
   |                      |  
   |                      |  
   |                      |  0x65
   |                      |----------> [0xAF65] ---> {hashed name: 0xAF656E8F...}
   |
   |  0x3C
   |------------------> [0x3C] -----> {0x3C246AF3E...}
   |
   |
   |  0x12
   |------------------> [0x12]
   |                      |  
   |                      |  0x65
   |                      |----------> [0x1265] ---> {0x12653A5F1D....}
   |                      |  
   |                      |  
   |                      |  0x33
   |                      |----------> [0x1233]
   |                                      |  
   |                                      |  
   |                                      |  0xAF
   |                                      |----------> [0x1233AF] ----> {0x1233AF...}
   |                                      |  
   |                                      |  
   |                                      |  0x1D
   |                                      |----------> [0x12331D] ----> {0x12331D...}

Implementation

Notes about the implementation (focusing on the more obscure points that need to be refactored).

In a normal trie we could have values at any node in the tree (besides the root). In this trie only leaf nodes have values since we're using a hash function (of fixed length) to index them so an internal node with a value would mean an entry with a shorter hash (the node's prefix) which is not possible. In this particular implementation the leaf nodes (shardValue) are encoded inside the parent to avoid making an extra request at the DAG level (improving performance). The result is that a normal node (Shard) can have many values (entries in the directory) since it's encoding many leaf nodes internally. This is all hidden behind the child interface which obscures the code considerably, as a result we have a Label method only implemented for the shardValue (the Shard returns an empty string) and the Link method being used for very different purposes in the Shard and shardValue (one is the link to another node in the HAMTDirectory and the other is actually the value stored in the leaf node which extends beyond the directory).

Another ramification of this design decision is that to distinguish at the DAG level which link points to another HAMTDirectory node or which points to a value (an entry in the directory) the names of the links are overloaded with a hexadecimal string prefix (corresponding to the new part of the hash prefix that is added in the new edge of the trie). The maxpadlen attribute encodes the length of that string prefix in the tree and names bigger than that are links pointing to entries (where the actual name of the entry is the sub-string after the hash prefix, e.g., A3file with maxpadlen of 2) and links with names equal to maxpadlen just point to another internal node in the directory (that will have a prefix that includes this added sub-hash). Example: https://github.com/ipfs/go-unixfs/blob/master/hamt/hamt.go#L278. The result is the Shard holding an attribute (prefixPadStr) which just encodes a string used for a printf call which is not easy to interpret correctly (https://github.com/ipfs/go-unixfs/blob/master/hamt/hamt.go#L87).

Since this is already encoded in existing sharded directories already deployed I don't think that it will be possible to be modified (extracting this information from the DAG into the UnixFS layer) but should be clearly documented and encapsulated in a separate function.

There are two layers interacting here, DAG and UnixFS. When a directory is being manipulated (adding/removing entries) the information is encoded in the UnixFS layer (through the children attribute) and only flushed to the DAG layer (the node containing all this information) when Node() is called, but at times there is a decoupling which is to be expected but not clearly documented, and without a clear API of how to interact with the Shard in that state, e.g., how to enumerate its entries: should the links of the DAG node be checked or the children slice? At times there is only one of those that have the accurate information, e.g., when loading a Shard from a DAG node children is empty (the DAG links should be checked) and when modifying the Shard the information is encoded in the children but not passed to the DAG node. As an example of what would be needed see the FSNodeOverDag which defines a specific API to interact with a mutable object between the two layers.

The internal ProtoNode of the Shard (nd) isn't really used except as a link slice so if the interaction previously described is clearly defined (and correctly decoupled) there is no need to keep this cache node (which isn't even updated when Node is called), only the links need to be kept structuring them as just a performance improvement: instead of loading all the child nodes immediately we keep the links and request them on an as-needed basis (but they shouldn't be considered to hold the actual state of the node). This may not be possible since the new hash part is encoded in the link name (instead of inside the Shard object of the UnixFS layer where it belongs).

===

Sub issues:

schomatis commented 6 years ago

@overbool you up for some real challenge? :grin: I may throw some tasks at you to help me understand what's going on there.

overbool commented 6 years ago

@schomatis 👌, I am willing to accept the challenge, come on.

zot commented 4 years ago

Is this experimental? I see UseHAMTSharding = false and sourcegraph can't find any references to the variable outside the project...

aschmahmann commented 4 years ago

@zot I don't know how good sourcegraph is at finding things, but here are a few references https://grep.app/search?q=UseHAMTSharding

HAMT Sharding is experimental within go-ipfs https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#directory-sharding--hamt