Sharding - unixfs support for large directories

willglynn commented 9 years ago

My original post to ipfs/go-ipfs#1720:

This might be an extreme case, but consider someone wishing to make Wikipedia available on IPFS:

$ zcat dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-all-titles-in-ns0.gz | wc -l
 11944438

It seems like a bad idea to have a merkledag node with 12M links, but that would be the representation using unixfs Data_Directory. It seems like a similarly bad idea to have a merkledag node past even 1k links, and directories with a thousand files occur more commonly in practice.

Alternate directory representations (Data_PrefixTreeDirectory and/or Data_HashTableDirectory) might be a solution, using intermediate merkledag nodes to ultimately reference all of a large directory's children in a way that permits efficient traversal and selective retrieval. The distinction between directory representations could be transparent to users, with implementations free to choose whichever data structure it deems suitable.

Going back to the example: the list of Wikipedia pages is 67 MB gzipped, before adding hashes. A user shouldn't have to download ~400 MB of protobufs just to find the hash of a single article page.

What's a sensible upper limit for a merkledag node's link count or encoded size in bytes? Is there precedent for reducing merkledag fan out? What other components would need to know about a new unixfs DataType?

@whyrusleeping replied:

We've been thinking about doing fanout on unixfs directories for some time now. But the 'best' way to do it isnt immediately apparent. There are a few different strategies,

the first is to just add all entries past the first N to a specially named link. And go down that way until we have stored all the entries. This is really convenient for listing directories, as you can just keep loading more until you hit the end, providing output the whole way. The downside is that searching for a certain directory becomes.... hard.

The second way is to shard them git style, and have K links at each level, each with log(k) characters in their name, and recurse that downward until each node has a sane number of links. This method works a lot better for lookups on large amounts of links, but kinda sucks for listing directories.

1k links is totally fine, our max block size is 256k (soft limit), and links are 47 bytes, with that we can store around 5500 links in each node and still be under 256k.

My reply:

COW filesystems are also worth exploring.

ZFS implements POSIX directories on top of "ZAP", the ZFS Attribute Processor, which provides a generalized key-value data structure on top of the underlying COW storage layer. ZAP objects, and thus POSIX directories, are encoded into COW objects for storage using two distinct formats.

The smaller format ("microzap") lumps an entire directory together into a single COW object containing a big boring hash table. It's used if the resulting object would fit into a single block (<= 128 KB) and the keys/filenames would all fit into fixed-sized structs (64 bytes each, so filenames must be <= 50 bytes). In practice this means most directories under 2000 files can be stored trivially, and every directory modification requires writing an entire replacement directory object to the storage layer. This seems analogous to Data_Directory, though the microzap structure uses defined sorting to optimize lookups.

The larger format ("fatzap") is a hash table guaranteed to span at least two objects. The top object is a pointer table, which references 2^N leaf objects, each acting as a hash table bucket. Leaf objects contain a header describing their contents ("I store objects whose hash starts with 10111") and the corresponding hash table, which in the case of a POSIX directory includes filenames pointing to file objects. There are a couple other less common bits of indirection too (leaf chaining for pointer table collisions and external pointer tables past 2^13 leaves), but those are details. The key feature is locality: ZFS can find what it's looking for, either for retrieval or modification, without touching too many blocks or even too many bits inside a block.

ZFS directories work well in practice having these two formats. POSIX readdir() does not guarantee that its results have any particular sorting, and hash ordering on-disk has better properties for lookups than lexical ordering, so that's what ZFS uses.

One other design note: ZAP objects/ZFS directories use a salted hash, meaning that multiple ZFS directories containing the same files will be stored and returned in different orders. I don't know the exact rationale behind this ZFS feature, but Go maps were changed from "undefined" to random iteration because an undefined yet stable ordering allowed people to depend on it.

@jbenet replied:

Thanks @willglynn these are great suggestions!

Maybe we should probably move this discussion into the https://github.com/ipfs/specs repo? it may be good to move discussion about general system changes to that repo. I'm sorry, i know it's very far from complete, but maybe with some help we can get it there!

there's one tricky thing to solve around this: how to discern whether a request should access the underlying ipfs objects, or get the "virtual" unioned directory instead. today we achieve this by separating ipfs {add, cat, ls} (unixfs) from ipfs object {get, put, data, links} (merkledag). in the future, i'd like to find some resolution to this in a way that doesn't create N+1 interfaces (where N is the number of "datastructures layered on ipfs" or "applications"), and instead just 2. (( tricky indeed )).

willglynn commented 9 years ago

how to discern whether a request should access the underlying ipfs objects, or get the "virtual" unioned directory instead

Resolving ambiguity between the merkledag layer and the unixfs layer seems relevant to ipfs/go-ipfs#1710. Maybe merkledag paths really should be [][]bytes…

jbenet commented 9 years ago

well, i want paths to be human readable in the merkledag layer. the unixfs layer isn't per-se a layer, as other datastructures (like tar, bitcoin, git, keychain, etc) will coexist at that level.

willglynn commented 9 years ago

Okay, I think I've collected that thought.

The use I had in mind for a []byte link name was for data structures containing many links (like hash tables) that someone might want to put in the merkledag. These links need to be attached to the merkledag nodes in a standard fashion so that generic tools can retrieve associated nodes without needing to understand the data structure itself. At the same time, encoding a 2^13 hash table as a series of Link structures named 0, 1, 2, … 1fff is just waste; a link named 0x1fff at least has the benefit of being slightly more compact.

In this scenario, nodes would have links that are a) referenced positionally and b) present in tremendous quantities, making explicit human-accessible names both unnecessary and undesirable. If that use case turns out to be significant, then there's an even better alternative:

Extend merkledag nodes to have an optional LinkTable, using repeated … [packed=true] and other tricks to encode link hashes with maximum efficiency, omitting explicit link names.
Extend the merkledag API to decode LinkTable into Links for naïve consumers, synthesizing names sequentially based on position, and returning links from the LinkTable alongside any other Links present on the node. They're all links; LinkTable is just an alternate encoding.
Allow sophisticated consumers (e.g. a unixfs directory hash table implementation) to access the LinkTable by position, permitting direct lookups without parsing/instantiating/garbage collecting thousands of unnecessary Links.

So, nevermind. Nodes could contain links named 0, 1, 2, …1fff without being awful – or at least it could look like they do. Human-accessible merkledag paths it is :-)

daviddias commented 8 years ago

I would like to call attention to this issue again. In order to get the npm over IPFS fully working, we need to make a call and pick a direction when it comes to directory sharding.

davidar commented 8 years ago

I vote for encoding large maps/directories as radix trees. Git-style sharding is essentially just a (less adaptive) approximation to this. It makes directory listing a little more difficult, but not intractably so (just a DFS traversal).

A naive implementation (a merkle-link for each node in the tree) would generate a lot of tiny blocks, so we'd probably want to perform inlining of nodes to fill up blocks to a reasonable size.

The use I had in mind for a []byte link name was for data structures containing many links (like hash tables) that someone might want to put in the merkledag.

I don't think hashtables make much sense in the context of IPFS — immutable/persistent datastructures are a lot better aligned with the strengths and weaknesses of IPFS. Radix trees can fulfil the same function as hashtables, but have better update and deduplication properties in this context.

encoding a 2^13 hash table as a series of Link structures named 0, 1, 2, … 1fff is just waste

If we're using IPLD lists/arrays, encoded with CBOR, I think it should be fairly efficient for cases like that. For large arrays, we can just chunk them the same as files/bytestrings.

jbenet commented 8 years ago

More discussion on sharding: https://github.com/ipfs/notes/issues/76

ipfs / specs

Sharding - unixfs support for large directories #32