ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.05k stars 3.01k forks source link

What does it mean to be a directory in the UnixFS layer? #5157

Closed schomatis closed 6 years ago

schomatis commented 6 years ago

The Directory type of a UnixFS object (contrary to its name) doesn't actually indicate a directory, or at least it's not the only possible type of a directory, the HAMTShard type also indicates a directory, leading to confusing comparisons like:

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/mfs/dir.go#L159-L161

So Directory is only the "plain implementation" type of a directory, it's not the directory type (a very similar confusion arises with the File and Raw types). What are the requisites of a UnixFS directory? It seems that there is no interface that would define it, at this point the closest that can be find is the Directory from the unixfs.io package,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/io/dirbuilder.go#L28-L36

but this structure emerged just to accommodate the HAMT implementation (https://github.com/ipfs/go-ipfs/pull/3042, it evolved from a previous directoryBuilder structure), not as a clear and documented definition of a directory.

The biggest consequence of all this (IMO) is that a new reader needs to go through the *hamt.Shard structure pointer to understand what's happening when content is added to a directory (with the major danger of diving in the really complex code of the hamt package).

So the main objective is to provide the user a clear explanation of what it can expect from a directory, possibly presenting a documented interface, while also using it to hide as much as possible the HAMT directory variant, to avoid confusing code lines when the user is following the execution path of how are files added to a directory in the MFS hierarchy,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/io/dirbuilder.go#L101-L107

For example, looking for common characteristics of the plain and the HAMT (which I mostly do not understand) implementations I'm observing that its children are still referenced as DAG links (HAMT contains a map for a quick access to them but the links remain at the DAG layer); on the contrary, whereas the plain implementation does not hold information in the Data field the HAMT stores the bit field there.

Stebalien commented 6 years ago

So Directory is only the "plain implementation" type of a directory, it's not the directory type (a very similar confusion arises with the File and Raw types)

Yes. Sharded directories came later so they got a different type.

For example, looking for common characteristics of the plain and the HAMT (which I mostly do not understand) implementations I'm observing that its children are still referenced as DAG links (HAMT contains a map for a quick access to them but the links remain at the DAG layer).

So, all links must go in the Links section of the PBNode. This is just a restriction of the PBNode IPLD format (and the primary motivation to move away from it).

However, sharded directories span multiple nodes. Therefore, for better performance, we cache these shards when we load them. That's why we have the children map.

schomatis commented 6 years ago

However, sharded directories span multiple nodes.

Oh, could you point me to that part of the code? I thought a shard was encoded in a single Data of a DAG node,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/hamt/hamt.go#L142-L143

schomatis commented 6 years ago

So, while working on https://github.com/ipfs/go-ipfs/pull/5160 I realized that MFS is the single consumer of the UnixFS directory abstraction, the definition of the interface is basically what the MFS layer needs from UnixFS.

schomatis commented 6 years ago

@Stebalien

So, all links must go in the Links section of the PBNode. This is just a restriction of the PBNode IPLD format

I'm not sure what kind of restriction is that. I could just save the links in the Data field, as long as the UnixFS methods called by MFS know how to operate on it I don't need to use the links of the DAG layer, I should, of course, but as the API is just that set of functions defined in the dirbuilder.go file, I don't see any requirement (and there is also no spec) that actually indicates that. That's what I'm mean with the question of this issue, it seems to me that a UnixFS directory is whatever UnixFS wants it to be as long as it delivers to MFS, and that mental model is hard to transmit to a new user, that's why I want to have a concrete list of "things a UnixFS directory should comply with" (besides a set of methods), to be able to provide more information to the new user.

Stebalien commented 6 years ago

Oh, could you point me to that part of the code? I thought a shard was encoded in a single Data of a DAG node,

Take a look at getChild and loadChild (on Shard).

I'm not sure what kind of restriction is that. I could just save the links in the Data field, as long as the UnixFS methods called by MFS know how to operate on it I don't need to use the links of the DAG layer, I should, of course, but as the API is just that set of functions defined in the dirbuilder.go file, I don't see any requirement (and there is also no spec) that actually indicates that.

You could. However, tools like pin, refs, etc. wouldn't work properly.

That's what I'm mean with the question of this issue, it seems to me that a UnixFS directory is whatever UnixFS wants it to be as long as it delivers to MFS, and that mental model is hard to transmit to a new user, that's why I want to have a concrete list of "things a UnixFS directory should comply with" (besides a set of methods), to be able to provide more information to the new user.

So, there are multiple types of unixfs directories. At the end of the day, they should all expose a standard interface (really, iterate and lookup). However, we're always going to have multiple very different representations. Currently, we have:

  1. The Directory type for small directories.
  2. The HAMTShard type for shards of larger directories.

We actually have the same split in files. We have files that are raw bytes and sharded files.


So, for context, HAMT stands for hash array mapped trie. To insert a key/value (name/file) into a HAMT, you first hash the key. Then, roughly,

  1. Pop off a prefix from the hash of the key (we take 1 byte giving us 256 entries per shard) and use this prefix as an index into the children array of the current node.
  2. Lookup the child node at that index.
  3. If it's a value:
    1. If the keys are equal, replace it, return.
    2. Otherwise, replace it with a new shard and continue to step 4 (inserting both the displaced value and the new value).
  4. If it's a shard, recurse back to step 1 (popping a new byte off the hash of the key).
  5. If there's nothing at that index, insert the value.

However, instead of storing sparsely populated arrays as suggested by this algorithm, we use a bitfield. To lookup a value in the array, you:

  1. Lookup the corresponding bit in the bitfield. If it's 1, there's a value at this position.
  2. Count the number of 1 bits before that bit. This count is the index in the Links array.
schomatis commented 6 years ago

Extra :+1: for the detailed HAMT explanation.

schomatis commented 6 years ago

This issue has been addressed in a couple of PRs now (the last one being https://github.com/ipfs/go-unixfs/pull/16) we can close it.

Frando commented 2 years ago

It would be great to add this information on how the UnixFS HAMT works to the UnixFS spec which currently doesn't have any info on how the HAMTShards are actually encoded and supposed to be used. This is the best explanation here that I found when googling for a spec of the UnixFS HAMT.